This post originally appeared on The Timmerman Report. You should check out the TR.
It’s baseball season. Which means it’s fantasy baseball season. Which means I have to keep reminding myself that, even though it’s already been a month and a half, that’s still a pretty short time in the long rhythm of the season and every performance has to be viewed with skepticism. Ryan Zimmerman sporting a 0.293 On Base Percentage (OBP)? He’s not likely to end up there. On the other hand, Jake Odorizzi with an Earned Run Average (ERA) less than 2.10? He’s good, but not that good. I try to avoid making trades in the first few months (although with several players on my team on the Disabled List, I may have to break my own rule) because I know that in small samples, big fluctuations in statistical performance in the end are not really telling us much about actual player talent.
One of the big lessons I’ve learned from following baseball and the revolution in sports analytics is that one of the most powerful forces in player performance is regression to the mean. This is the tendency for most outliers, over the course of repeated measurements, to move toward the mean of both individual and population-wide performance levels. There’s nothing magical, just simple statistical truth.
And as I lift my head up from ESPN sports and look around, I’ve started to wonder if regression to the mean might be affecting another interest of mine, and not for the better. I wonder if a lack of understanding of regression to the mean might be a problem in our search for ways to reach better health.
Like so many things, the problem arises from great intentions. The gold standard for testing interventions in health, whether they’re new drugs, devices, or new treatment methodologies, is the random, double-blind placebo controlled trial. A cohort of individuals who are suffering from a disease gets assigned randomly into one of two groups–treated with the real drug or a placebo–and everyone, clinicians and patients, is blinded as to who is in which group. But even designs as robust as this can have flaws.
I’ve been involved in a number of trials in which the choice of participants has been guided by specific criteria of disease activity. This might be called the Disease Activity Score (DAS) or the Expanded Disability Status Scale (EDSS) or something similar and is a sum of different clinical measures. In multiple sclerosis (MS) for example, EDSS can include how impaired a patient is across a number of functional system scores. Think of it as a composite rate stat, like On Base Percentage + Slugging (OPS).
Here’s where the influence of regression to the mean comes in. Patients are often enrolled in trials when they are showing high disease scores. Multiple sclerosis, like many autoimmune diseases, tends to flare up and calm down over time. Sometimes patients feel OK, other times they feel like crap. The problem with randomized trials is that patients are most likely to be enrolled when their disease is flaring. This is for the practical reason that researchers are running the study because they want to know if a drug or intervention has a clear effect, and it’s easier to observe an effect when you’re starting with a patient that presents as far more sick than average.
And yet that’s also the exact situation where regression to the mean comes in. Are we selecting patients for trials who are more likely to get better just due to regression? And is that confounding our data and interpretation? Maybe this wouldn’t be a problem if disease symptoms were uniform and proceeded in a linear way, always getting worse. But that’s not the case.
Now, the placebo controlled arm is supposed to control for things like regression to the mean and maybe in many cases it’s enough. But the numbers of patients participating in trials, especially in Phase 1 and Phase 2, tend to be small. Random sampling taken as a cross section of a diverse population at a moment in time could yield very non-representative behavior. For example, right now if I took the 10 top hitters based on batting average there would be some familiar names and some surprises. I mean, Dee Gordon, a career .286 hitter, will not sustain the highest batting average in the history of the modern baseball era. A random shuffling of these hitters into two groups could easily lead to one group outperforming the other over the next five months, just by chance, specifically because we’re selecting hitters based on their performance at a single moment in time (in analogy to peak disease score) rather than by their accumulated performance data over the past several years.
Table stats as of May 11, 2015.
And this concept of skewed sampling holds for clinical studies. A recent paper by Ariel Linden modeled the regression to the mean effect and found that standard calculations of the regression effect could severely underestimate the contribution of regression to outcome data when the initial populations are skewed, as might happen when taking relatively small samples.
So how to try and overcome that? Clinical researchers have been trying hard to find elements of disease and biology that indicate true disease. In baseball, researchers have identified elements such as O-swing and Z-swing percentage, and average velocity to get a handle on player ability outside of in-game performance (which is highly variable and subject to factors outside the player’s control, like cold weather early in the season). In clinical research, we often look at fairly subjective outcomes, like the degree of fatigue a multiple sclerosis patient’s reported feeling. To dig deeper, at a molecular level of disease, we look at biomarkers which we hope are more reliable and objective. Only the biomarkers are often not very consistently correlated with an observable clinical outcome (like the patient feeling better) or they aren’t very predictive of whether a patient is heading in a positive or negative direction. Replication of biomarkers in different studies and different labs is often difficult.
Another way, which unfortunately would do nothing to speed up drug development and would cost more besides, would be to track patients for a long time and develop an individual picture of each patient’s disease course and behavior. The idea is to accumulate that kind of long-term data that allows an understanding of each patient’s baseline and then apply that knowledge in evaluating whether a drug or other treatment has had a real effect. I’ve banged the drum on this idea before, but I admit I don’t know how long it would take. In multiple sclerosis, for example, it might take 2-3 years for a given therapy to be adequately tested against the various ups and downs the patient is bound to experience. The overall time needed is probably longer than today’s typical randomized trial. Probably, just like stabilization rates in baseball, the amount of time and specific metrics to track would differ for each disease.
Maybe this is an opportunity for our patient-focused advocacy groups to proactively marshal their populations to accumulate this data, so that when their members are selected for trials they can appear, data in hand, to enrich a study design. Hell, it could be a great incentive for commercial interests to do work in a given disease area, knowing they don’t need to build this kind of resource from scratch.
Regression to the mean is in no way the only factor contributing to the difficulty of getting a drug or intervention approved. Safety concerns, real efficacy, business models, the immense difficulty of trying to engineer complex systems like human beings and other things play huge roles. Still, I think it’s useful to consider everything we can because, like Major League Baseball today, we need to find every edge to hope to have a chance of success.