Skip Navigation
Technical Methods Report: What to Do When Data Are Missing in Group Randomized Controlled Trials

NCEE 2009-0049
October 2009

Appendix A: Missing Data Bias as a Form of Omitted Variable Bias

One way to better understand the missing data problem is to see how it is related to the very first type of bias to which most of us were introduced in our first regression class, omitted variable bias. Suppose the true model of impacts is shown in equation (1):

(1) Y01X +β2Trt +ε, where ε ~ N(0,σ1I)

where X is the baseline covariate and Trt is the treatment group indicator. However, suppose that the researchers conducting the RCT estimate a simpler model that excludes the baseline variable, as shown in equation (2):

(2) Y01Trt +v, where v ~ N(0,σ2I)

The decision to estimate equation (2) instead of (1) may be driven by lack of knowledge—the researchers may not realize that the baseline variable affects the outcome—or by necessity—if the variable is inherently unobservable. However, the decision may have been based on the belief that “simpler is better” in RCTs since without missing data problems, RCTs yield unbiased impact estimates even when no control variables are included.

However, suppose some of the data are missing. More specifically, suppose that the outcome variable (Y) is missing for some cases, and the researchers plan to drop cases with missing values. (Other approaches to missing data are considered in the body of the report, but the consequences of dropping cases with missing values may be the best tool for illustrating the consequences of missing data.) In this very common scenario, will the researchers obtain unbiased estimates of the treatment effect (β2)? The answer is “it depends” or “only in special cases.” If the observations with non-missing values are just a simple random sample of the larger sample (MCAR), the answer would be yes. The only consequence is a smaller sample and less statistical power. If the observations with non-missing values are at least random conditional the independent variables in the model (the MAR category), then the answer is still yes.

What does this mean for our simple example? It means the RCT can obtain unbiased impact estimates if (1) the data are missing completely at random (just a coin toss or roll of the dice) or (2) the data are missing at random within each group defined by the only covariate included in the model: the treatment indicator. Exactly how this can be achieved is a core portion of the remainder of this appendix. Scenario (2) warrants some additional consideration since a difference in response rates between the treatment and control groups might be taken as a sign that the impact estimates are biased. However, as long as the process behind the missing data is completely random within group, it does not matter if the percentage of cases with missing data differs between the two groups: the treatment effect will still be unbiased.

However, there are still two potential pitfalls that could lead to biased estimates of the average impact of the treatment (both fall under the NMAR case). First, even where the occurrence of “missingness” is unrelated to treatment status, it can be related to other variables that have been omitted from the model (like X has been omitted from equation (2)) and cause bias. This is a case where missingness causes the observed treatment and control group outcome samples to be “equally unrepresentative” of the population of interest (i.e., the population these samples would represent if the outcomes data were totally complete). For example, suppose the outcome variable in equation (2) is the student's score on the state assessment in reading, and the observed baseline variable excluded from the model (X in equation (1)) equals 1 for Limited English Proficiency (LEP) students and 0 for non-LEP students. If the missing data rate is larger for LEP students than for other students, and equally larger for the treatment group and the control group samples, then the analysis sample of students—that is, the students with non-missing data—would be skewed toward non-LEP students in both the treatment group and the control group.

So when is this a problem? It is a problem when the impact of the treatment differs between LEP students and other students. For example, suppose the impact of the program on reading achievement is larger for LEP students than for other students. If LEP students are underrepresented in the analysis sample due to missing data, this will pull the estimated impacts downward. In this example, and many like it, random assignment will provide an unbiased estimate of the treatment's average impact for students with nonmissing data. However, because missing data has skewed both the treatment and control samples toward non-LEP students, for whom the impacts are relatively small, equation (2) will yield an downwardly biased estimate of the treatment's average impact for students in the broader study sample (and for whatever population this sample was designed to represent).

The second potential pitfall arises if missing data are related to both treatment status and a variable that has been omitted from the model. In this context, the analysis sample in both groups (treatment and control) will be unrepresentative of the broader population of students. However, because missing data is related to both treatment status and the omitted variable, the analysis samples in the treatment group and the control group will not be “equally unrepresentative,” i.e., the treatment and control samples will be “differentially skewed” toward non-LEP students. While the first pitfall yields unbiased impact estimates for the wrong population, this pitfall yields biased impact estimates for the wrong population. In both instances, the wrong population is being studied in relation to the information policy makers need about the full set of students potentially impacted by an intervention.

To gain a better understanding of the second pitfall, let us build on the example developed in this section. The treatment and control samples used in the analysis could be “differentially skewed” toward non-LEP students if the treatment itself has a positive effect on English proficiency, and LEP students with higher English proficiency are more likely to be required to take the state test used to create the outcome variable for the analysis. In this scenario, within the analysis sample, the treatment group would be less skewed toward non-LEP students than the control group.

Mathematically, this introduces omitted variable bias by creating a positive correlation between treatment status (Trt in equation (2)) and the omitted variable (X or LEP status in equation (1)) in the observed sample.78 Among students with complete data, LEP students are more likely to be in the treatment group than in the control group. If LEP status has a negative effect on the outcome—reading achievement, as measured by the state test—the positive correlation between the treatment and LEP status among students in the analysis sample will produce a negative bias in the impact estimate. Put differently, in this scenario, the RCT will understate the true impact of the treatment. More generally, when there are a variety of omitted variables that are related to both the outcome and its missing data pattern the bias due to missing data could be positive or negative.

There are two major lessons that can be gleaned from this discussion:

  • The situations under which we can obtain unbiased impact estimates if we simply exclude students with missing data are very restrictive: Excluded students must be a random sample of students conditional on the independent variables included in the model, including the treatment indicator. Furthermore, the scenarios under which the missing data process is more complicated are quite plausible in many settings.
  • When missing data bias is considered, covariates play a more important role in RCTs than is commonly believed. In fact, because we can see that bias due to missing data can be thought of as omitted variable bias, one approach to the problem is clear: include in the regressions used to estimate impacts variables that may influence both the outcome and the probability of having missing data on outcomes. While this is not the only approach to addressing missing data, it may be the simplest and most straightforward approach. Therefore, while covariates only serve to improve precision in RCTs when data are complete, in the real world, where data are never complete, covariates can help to reduce bias due to missing data.

Top

78 A non-zero correlation between the treatment/control group status indicator and an X variable measured prior to random assignment can never occur for the sample as a whole in an expected value sense, since treatment status is generated subsequent to that measurement and bears no relationship to anything else (having emerged from a random number generator or flip of a coin).