A. Introduction
Most statistics textbooks provide lengthy discussions of the theory of probability,
descriptive statistics, hypothesis testing, and a range of simple to more complex
statistical methods. To illustrate these discussions, the authors often present
examples with real or fictional data—tidy tables of observations and variables
with values for each cell. Although this may be entirely appropriate to illustrate
statistical methods, anyone who does "real world" research knows that data are rarely,
if ever, so complete. Some study participants may be unavailable for data collection,
refuse to provide data, or be asked a question which is not applicable to their
circumstances. Whatever the mechanism that causes the data to be missing, it is
a common problem in almost all research studies.
This report is designed to provide practical guidance on how to address the problem of missing data in in the analysis of data in Randomized Controlled Trials (RCTs) of educational interventions, with a particular focus on the common educational situation in which groups of students such as entire classrooms or schools are randomized (called Group Randomized Trials, GRTs). The need for such guidance is of growing importance as the number of educational RCTs has increased in recent years. For example, the ten Regional Educational Laboratories (RELs) sponsored by the Institute of Education Sciences (IES) of the U.S. Department of Education are currently conducting 25 RCTs to measure the effectiveness of different educational interventions,1 and IES has sponsored 23 impact evaluations that randomized students, schools, or teachers since it was established in 2002.2
This report is divided into four chapters. Following a brief overview of the missing data problem, this first chapter provides our overall guidance for educational researchers based on the results of extensive data simulations that were done to assess the relative performance of selected missing data strategies within the context of the types of RCTs that have been conducted in education. Chapter 2 sets the stage for a discussion of specific missing data strategies by providing a brief overview of the design of RCTs in education, the types of data used in impact analysis, how these data can be missing, and the analytical implications of missing data. Chapter 3 describes a selection of methods available for addressing missing data, and Chapter 4 describes the simulation methodology and the statistical results that support the recommendations presented in this chapter. Appendices provide additional details on the simulation methods and the statistical results.
B. Missing Data and Randomized Trials
The purpose of a randomized controlled trial (RCT) is to allow researchers to draw
causal conclusions about the effect, or "impact," of a particular policy-relevant
intervention
(U.S. Department of Education, 2009). For example, if we wanted to know how students do when they are taught with a particular reading or math curriculum, we could obtain test scores before and after they are exposed to the new mode of instruction to see how much they learned. But, to determine if this intervention caused the observed student outcomes we need to know how these same students would have done had they not received the treatment.3 Of course, we cannot observe the same individuals in two places at the same time. Consequently, the RCT creates equivalent groups by randomly assigning eligible study participants either to a treatment group, that receives the intervention under consideration, or to a control group, that does not receive the particular treatment but often continues with "business as usual," e.g., the mode of instruction that would be used in the absence of a new math curriculum.4 5 Because of the hierarchical way in which schools are organized, most education RCTs randomize groups of students—entire schools or classrooms—rather than individual children to study conditions. In these GRTs the treatment is typically delivered at the group or cluster level but the primary research interest is the impact of the selected treatment on student outcomes, although it is also not uncommon to look for intermediate impacts on teachers or schools.
The advantage of the RCT is that if random assignment is properly implemented with a sufficient sample size, treatment group members will not differ in any systematic or unmeasured way from control group members except through their access to the intervention being studied (the groups are equivalent both on observable and unobservable characteristics). It is this elimination of confounding factors that allows us to make unbiased causal statements about the effect of a particular educational program or intervention by contrasting outcomes between the two groups.
However, for an RCT to produce unbiased impact estimates, the treatment and control groups must be equivalent in their composition (in expectation) not just at the point of randomization (referred to as the "baseline" or pretest point), but also at the point where follow-up or outcome data are collected. Missing outcome data are a problem for two reasons: (1) the loss of sample members can reduce the power to detect statistically significant differences, and (2) the introduction of non-random differences between the treatment and control groups can lead to bias in the estimate of the intervention's effect. The seriousness of the potential bias is related to the overall magnitude of the missing data rate, and the extent to which the likelihood of missing data differs between the treatment and control groups. For example, according to the What Works Clearinghouse6 the bias associated with an overall attrition rate of ten percent and a differential treatment-control group difference in attrition rates of five percent can be equal to the bias associated with an overall attrition rate of 30 percent and a differential attrition rate of just two percent.
Therefore, in a perfect world, the impact analysis conducted for an RCT in education would include outcomes for all eligible study participants defined at the time of randomization. However, this ideal is rarely ever attained. For example, individual student test scores can be completely missing because of absenteeism, school transfer, or parental refusal for testing. In addition, a particular piece of information can be missing because respondents refuse to answer a certain test item or survey question, are unable to provide the requested information, inadvertently skip a question or test item, or provide an unintelligible answer. In an education RCT, we have to also concern ourselves with missing data at the level of entire schools or classrooms if randomly assigned schools or classrooms opt either out of the study completely or do not allow the collection of any outcome data.
As demonstrated by Rubin (1976, 1987), the process through which missing data arise can have important analytical implications. In its most innocuous form—a category that Rubin calls Missing Completely at Random (MCAR)—the mechanism that generates missing data is a truly random process unrelated to any measured or unmeasured characteristic of the study participants. A second category—Missing at Random (MAR)—is one in which missingness is random conditional on the observed characteristics of the study sample. For example, the missing data would be MAR if missingness on the post-test score were related to gender, but conditional on gender—that is, among boys or among girls—the probability of missing data is the same for all students. Typically, if one can reasonably assume that missing data arise under either the conditions of MCAR or MAR the missing data problem can be considered "ignorable," i.e., the factors that cause missingness are unrelated, or weakly related, to the estimated intervention effect. In some situations, however, one cannot reasonably assume such ignorability–a category that Rubin calls Not Missing at Random (NMAR). 7
Within the context of an RCT, if the missing data mechanism differs between the treatment and control groups, dropping cases with missing data may lead to systematic differences between the experimental groups which can lead to biased impact estimates. Furthermore, even if the missing data mechanism is the same for the treatment and control groups, we may still be concerned if certain types of teachers or students are under- or over-represented in the analysis sample. For example, if the impacts for underrepresented groups are higher than the average impact, then the impact estimates will be biased downward; if the impacts for underrepresented groups are lower than the average impact, the impact estimates will be biased upward.
C. Missing Data Methods
As noted above, missing data is a common problem in educational evaluations. For
example, in impact evaluations funded by the National Center for Educational Evaluation
and Regional Assistance (NCEE), student achievement outcomes are often missing for
10-20 percent of the students in the sample (Bernstein,
et al., 2009; Campuzano, et al., 2009;
Constantine, et al., 2009;
Corrin, et al., 2008; Gamse, et al., 2009;
Garet, et al., 2008; and
Wolf, et al., 2009).
Strategies used to address missing data in education RCTs range from simple methods like listwise deletion (e.g., Corrin, et al., 2009), to more sophisticated approaches like multiple imputation (e.g., Campuzano, et al., 2009). In addition, some studies use different approaches to addressing missing covariates and missing outcomes, such as imputing missing covariates but re-weighting complete cases to address missing outcomes (e.g., Wolf, et al., 2009). Despite the prevalence of the missing data challenge, there is no consensus on which methods should be used and the circumstances under which they should be employed. This lack of common standards is not unique to education research, and even areas where experimental research has a long history, like medicine, are still struggling with this issue. For example, guidance from the Food and Drug Administration (FDA) on the issue of missing data in clinical trials indicates that "A variety of statistical strategies have been proposed in the literature…(but) no single method is generally accepted as preferred" (FDA, 2006, p.29).
The selection of the methods that are the focus of this report was based on a review of several recent articles by statistical experts seeking to provide practical guidance to applied researchers (Graham, 2009; Schafer & Graham, 2002; Allison, 2002; and Peugh & Enders, 2004). Specifically, this report examines the following analysis strategies for dealing with the two types of missing data that are of primary importance when analyzing data from an educational RCT-missing outcome or post-test data and missing baseline or pretest data:
In the discussions that follow, we intentionally include methods that are commonly criticized in the literature—listwise deletion and simple mean value imputation—for two reasons. First, the use of those methods is widespread in education. For example, a review of 545 published education studies by Peugh & Enders (2004) showed a nearly exclusive reliance on deletion as a way to deal with missing data. Second, because we are focusing on RCTs, we wanted to understand how different missing data strategies performed within this unique context, including commonly used but criticized methods.
D. Guidance to Researchers
Basis for the recommendations
Our recommendations for dealing with missing data in Group Randomized Traials in
education are based on the results of extensive statistical simulations of a typical
educational RCT in which schools are randomized to treatment conditions. As discussed
in Chapter 4, selected missing data methods were examined under conditions that
varied on three dimensions: (1) the ammount of missing data, relatively low (5%
missing) vs. relatively high (40% missing); (2) the level at which data are missing—at
the level of whole schools (the assumed unit of randomization) or for students within
schools; and, (3) the underlying missing data mechanisms discussed above (i.e. MCAR,
MAR and NMAR).
The performance of the selected missing data methods was assessed on the basis of the bias that was found in both th estimated impact and the associated estimated standard, using a set of standards that were developed from guidance currently in use by the U.S. Department of Education's What Works Clearinghouse (see Chapter 4 and Appedix E).
The recommendations that are provided below are based on the following criteria:
Recommendations
Missing Pretest Scores Or Other Covariates
When pretest scores or other covariates are missing for students within schools
in studies that randomize schools the simulation results lead us to recommend the
use of the following missing data methods:
In this context, we would not recommend the use of three methods that produced impact estimates with bias that exceeded 0.05 in one or more of our simulations: case deletion, mean value imputation, and single non-stochastic regression imputation.
Alternatively, when data on baseline variables are missing for entire schools, the simulation results lead us to recommend the use of the following methods:10
We would not recommend the use of two methods that produced standard error estimates with bias in at least one of our simulations that exceeded the WWC-based threshold: single non-stochastic regression imputation and single stochastic regression imputation.
Across the two scenarios—i.e., situations when pretest or covariate data are missing either for students within schools or for entire schools—three methods were found to be consistently recommended so are likely to be the best choices:
It is important to stress that these recommendations are specific to situations in which the intent is to make inferences about the coefficient on the treatment indicator in a group randomized trial. If, for example, an analyst wanted to make inference about the relationship between pretest and post-test scores, and there were missing values on the pretest scores, we would not recommend use of the dummy variable approach. With this method, the estimate of the coefficient on the pretest score is likely to be biased, as has been described in previous literature. But when the interest is on the estimate of the treatment effect, the dummy variable approach yields bias in the coefficient of interest—the estimated treatment effect—that falls within the acceptable range as we defined it for these simulations, and is similar in magnitude to the biases obtained from the more sophisticated methods.
Missing Post-Test Scores Or Other Outcome Variables
When data on outcome variables are missing for students within schools in studies
that randomize schools, the simulation results lead us to recommend the use of the
following methods:
We would not recommend using mean value imputation because it was the only method that produced impact estimates with bias that exceeded 0.05 under the MAR scenario.
When data on dependent variables are missing for entire schools, the simulation results lead us to recommend the use of the following methods:
We would not recommend the use of the three methods that produced standard error estimates with bias that exceeded the WWC-based threshold: mean value imputation, single non-stochastic regression imputation, and single stochastic regression imputation.
Across the two scenarios—i.e., situations when data on post-test or outcome data are missing either for students within schools or for entire schools—five methods were found to be consistently recommended so are likely to be the best choices:
In addition, we recommend that if post-test scores are missing for a high fraction of students or schools (e.g., 40%), analysts should control for pretest scores in the impact model if possible. In our simulations, controlling for pretest scores by using them as regression covariates reduced the bias in the impact estimate by approximately 50 percent, and this finding was robust across different scenarios and different methods.11
As a final note, the recommendations provided above indicate that some methods that are easy to implement performed similarly to more sophisticated methods. In particular, where pretest scores were missing for either students or entire schools, the dummy variable approach performed similarly to the more sophisticated approaches and was among our recommended methods. And when post-test scores were missing for either students or entire schools, case deletion was among our recommended approaches. Consequently, we suggest that analysts take the ease with which missing data can be handled, and the transparency of the methods to a policy audience, into consideration when making a final choice of an appropriate method.
Other Suggestions For Researchers
In addition to the recommendations that we derive from the simulation results presented
in this report, we also recommend that all researchers adhere to the following general
analysis procedures when conducting any education RCT:
Final Caveats
Readers are cautioned to keep in mind that these simulation results are specific
to a particular type of evaluation—an RCT in which schools are randomized to experimental
conditions. Whether the key findings from these simulations would apply in RCTs
that randomize students instead of schools is an open question that we have not
addressed in this report. In addition, it is not clear whether the findings from
our simulations would be sensitive to changes in the key parameter values that we
set in specifying the data generating process and the missing data mechanisms. Finally,
we could not test all possible methods to address missing data. These limitations
notwithstanding, we believe these simulations yield important results that can help
inform decisions that researchers need to make when they face missing data in conducting
education RCTs.
Finally, despite all of the insights gained from the simulations, we cannot propose a fully specified and empirically justified decision rule about which methods to use and when. Too often, the best method depends on the purpose and design of the particular study and the underlying missing data mechanism that cannot, to the best of our knowledge, be uncovered from the data.