- 1. Overview and Guidance
- 2. Randomized Controlled Trials (RCTs) in Education and the Problem of Missing Data
- 3. Selected Techniques for Addressing Missing Data in RCT Impact Analysis
- 4. Testing the Performance of Selected Missing Data Methods
- References
- Exhibits
- Appendix A: Missing Data Bias as a Form of Omitted Variable Bias
- Appendix B: Resources for Using Multiple Imputation
- Appendix C: Specifications for Missing Data Simulations
- Appendix D: Full Set of Simulation Results
- Appendix D: Tables
- Appendix E: Standards for Judging the Magnitude of the Bias for Different Missing Data Methods
- PDF & Related Info

**A. Introduction**

Most statistics textbooks provide lengthy discussions of the theory of probability,
descriptive statistics, hypothesis testing, and a range of simple to more complex
statistical methods. To illustrate these discussions, the authors often present
examples with real or fictional data—tidy tables of observations and variables
with values for each cell. Although this may be entirely appropriate to illustrate
statistical methods, anyone who does "real world" research knows that data are rarely,
if ever, so complete. Some study participants may be unavailable for data collection,
refuse to provide data, or be asked a question which is not applicable to their
circumstances. Whatever the mechanism that causes the data to be missing, it is
a common problem in almost all research studies.

This report is designed to provide practical guidance on how to address the problem
of missing data in *in the analysis of data in Randomized Controlled Trials (RCTs)
of educational interventions*, with a particular focus on the common educational
situation in which groups of students such as entire classrooms or schools are randomized
(called Group Randomized Trials, GRTs). The need for such guidance is of growing
importance as the number of educational RCTs has increased in recent years. For
example, the ten Regional Educational Laboratories (RELs) sponsored by the Institute
of Education Sciences (IES) of the U.S. Department of Education are currently conducting
25 RCTs to measure the effectiveness of different educational interventions,^{1} and IES has sponsored 23 impact evaluations
that randomized students, schools, or teachers since it was established in 2002.^{2}

This report is divided into four chapters. Following a brief overview of the missing data problem, this first chapter provides our overall guidance for educational researchers based on the results of extensive data simulations that were done to assess the relative performance of selected missing data strategies within the context of the types of RCTs that have been conducted in education. Chapter 2 sets the stage for a discussion of specific missing data strategies by providing a brief overview of the design of RCTs in education, the types of data used in impact analysis, how these data can be missing, and the analytical implications of missing data. Chapter 3 describes a selection of methods available for addressing missing data, and Chapter 4 describes the simulation methodology and the statistical results that support the recommendations presented in this chapter. Appendices provide additional details on the simulation methods and the statistical results.

**B. Missing Data and Randomized Trials**

The purpose of a randomized controlled trial (RCT) is to allow researchers to draw
causal conclusions about the effect, or "impact," of a particular policy-relevant
intervention

(U.S. Department of Education, 2009). For example, if we wanted to know how students
do when they are taught with a particular reading or math curriculum, we could obtain
test scores before and after they are exposed to the new mode of instruction to
see how much they learned. But, to determine if this intervention **caused**
the observed student outcomes we need to know how these **same** students
would have done had they **not** received the treatment.^{3} Of course, we cannot observe the same individuals in two
places at the same time. Consequently, the RCT creates equivalent groups by **
randomly assigning** eligible study participants either to a *treatment group*,
that receives the intervention under consideration, or to a *control group*,
that does not receive the particular treatment but often continues with "business
as usual," e.g., the mode of instruction that would be used in the absence of a
new math curriculum.^{4} ^{
5} Because of the hierarchical way in which schools are organized,
most education RCTs randomize groups of students—entire schools or classrooms—rather
than individual children to study conditions. In these GRTs the treatment is typically
delivered at the group or cluster level but the primary research interest is the
impact of the selected treatment on student outcomes, although it is also not uncommon
to look for intermediate impacts on teachers or schools.

The advantage of the RCT is that if random assignment is properly implemented with a sufficient sample size, treatment group members will not differ in any systematic or unmeasured way from control group members except through their access to the intervention being studied (the groups are equivalent both on observable and unobservable characteristics). It is this elimination of confounding factors that allows us to make unbiased causal statements about the effect of a particular educational program or intervention by contrasting outcomes between the two groups.

However, for an RCT to produce unbiased impact estimates, the treatment and control
groups must be equivalent in their composition (in expectation) not just at the
point of randomization (referred to as the "baseline" or pretest point), but also
at the point where follow-up or outcome data are collected. Missing outcome data
are a problem for two reasons: (1) the loss of sample members can reduce the power
to detect statistically significant differences, and (2) the introduction of non-random
differences between the treatment and control groups can lead to bias in the estimate
of the intervention's effect. The seriousness of the potential bias is related to
the overall magnitude of the missing data rate, and the extent to which the likelihood
of missing data differs between the treatment and control groups. For example, according
to the *What Works Clearinghouse*^{6}
the bias associated with an overall attrition rate of ten percent and a differential
treatment-control group difference in attrition rates of five percent can be equal
to the bias associated with an overall attrition rate of 30 percent and a differential
attrition rate of just two percent.

Therefore, in a perfect world, the impact analysis conducted for an RCT in education would include outcomes for all eligible study participants defined at the time of randomization. However, this ideal is rarely ever attained. For example, individual student test scores can be completely missing because of absenteeism, school transfer, or parental refusal for testing. In addition, a particular piece of information can be missing because respondents refuse to answer a certain test item or survey question, are unable to provide the requested information, inadvertently skip a question or test item, or provide an unintelligible answer. In an education RCT, we have to also concern ourselves with missing data at the level of entire schools or classrooms if randomly assigned schools or classrooms opt either out of the study completely or do not allow the collection of any outcome data.

As demonstrated by Rubin (1976, 1987), the process through which missing data arise
can have important analytical implications. In its most innocuous form—a category
that Rubin calls *Missing Completely at Random (MCAR)*—the mechanism
that generates missing data is a truly random process unrelated to any measured
or unmeasured characteristic of the study participants. A second category—*Missing
at Random (MAR)*—is one in which missingness is random conditional
on the observed characteristics of the study sample. For example, the missing data
would be MAR if missingness on the post-test score were related to gender, but conditional
on gender—that is, among boys or among girls—the probability of missing
data is the same for all students. Typically, if one can reasonably assume that
missing data arise under either the conditions of MCAR or MAR the missing data problem
can be considered "ignorable," i.e., the factors that cause missingness are unrelated,
or weakly related, to the estimated intervention effect. In some situations, however,
one cannot reasonably assume such ignorability–a category that Rubin calls Not *Missing
at Random (NMAR)*. ^{7}

Within the context of an RCT, if the missing data mechanism differs between the treatment and control groups, dropping cases with missing data may lead to systematic differences between the experimental groups which can lead to biased impact estimates. Furthermore, even if the missing data mechanism is the same for the treatment and control groups, we may still be concerned if certain types of teachers or students are under- or over-represented in the analysis sample. For example, if the impacts for underrepresented groups are higher than the average impact, then the impact estimates will be biased downward; if the impacts for underrepresented groups are lower than the average impact, the impact estimates will be biased upward.

**C. Missing Data Methods**

As noted above, missing data is a common problem in educational evaluations. For
example, in impact evaluations funded by the National Center for Educational Evaluation
and Regional Assistance (NCEE), student achievement outcomes are often missing for
10-20 percent of the students in the sample (Bernstein,
et al., 2009; Campuzano, et al., 2009;
Constantine, et al., 2009;
Corrin, et al., 2008; Gamse, et al., 2009;
Garet, et al., 2008; and
Wolf, et al., 2009).

Strategies used to address missing data in education RCTs range from simple methods like listwise deletion (e.g., Corrin, et al., 2009), to more sophisticated approaches like multiple imputation (e.g., Campuzano, et al., 2009). In addition, some studies use different approaches to addressing missing covariates and missing outcomes, such as imputing missing covariates but re-weighting complete cases to address missing outcomes (e.g., Wolf, et al., 2009). Despite the prevalence of the missing data challenge, there is no consensus on which methods should be used and the circumstances under which they should be employed. This lack of common standards is not unique to education research, and even areas where experimental research has a long history, like medicine, are still struggling with this issue. For example, guidance from the Food and Drug Administration (FDA) on the issue of missing data in clinical trials indicates that "A variety of statistical strategies have been proposed in the literature…(but) no single method is generally accepted as preferred" (FDA, 2006, p.29).

The selection of the methods that are the focus of this report was based on a review of several recent articles by statistical experts seeking to provide practical guidance to applied researchers (Graham, 2009; Schafer & Graham, 2002; Allison, 2002; and Peugh & Enders, 2004). Specifically, this report examines the following analysis strategies for dealing with the two types of missing data that are of primary importance when analyzing data from an educational RCT-missing outcome or post-test data and missing baseline or pretest data:

**Appropriate for Missing Pretest Data Only:****Dummy Variable Adjustment**—setting missing cases to a constant and adding "missing data flags" to the impact analysis model.**Appropriate for Missing Post-test Data Only:****Weighting**—re-balancing the analysis sample to account for the loss of study participants.**Fully-Specified Regression Models**—adding to the impact analysis model terms that interact the covariates with the treatment indicator.**Appropriate for Both Types of Missing Data:****Imputation Methods**—filling in missing values using one of four methods single mean imputation, single non-stochastic regression imputation, simgle stochastic regression imputation, and multiple stochastic regression imputation.**Maximum Likelihood—EM Algorithm with Multiple Imputation**—a statistical estimation method that tries to find the population parameters that are most likely to have produced a particular data sample, using all of the available observations including those with missing data.**Selection Modeling and Pattern Mixture Modeling**—two attempts to deal with the NMAR situation by statistically modeling the missing data mechanism.- In general, we recommend avoiding methods for which the simulations indicated a bias, in either the magnitude of the estimated impact or its assicated standard error, that exceeded 0.05 standard deviations of the outcome measure (a standard developed on the basis of the current WWC guidance on simple attrition).
- The recommendations are based on the simulation results in which data were missing for 40 percent of students or schools. This is because the alternative stet of simulation results in which data were missing for five percent of either students or schools showed that all of the tested methods produced results that fell within the WWC-based standard. We recogize that many studies in education will have lower missing data rates than 40 percent. However, our recommendations are designed to be conservative—avoiding methods that produce a large about of bais when the missing data rate is 40 percent will reduce the likelihood that an evaluation will suffer from bias of this magnitude if the missing data rate is less than 40 percent.
- We provide recommendations separately for missing pretest scores
^{9}and for missing post-test scores (covariates): **For missing pretests,**recommended methods must have produced bias below the established thresholds for both impacts and standard errors. Boas in either estimate can lead to biased t-statistics and invalid statistical inference. Therefore, we only recommend methods that produce estimates with low bias for both impacts and stadard errors.

In addition, we only recommend methods that produce estimates with low bias*in all three scenarios,*i.e., MCAR, MAR and NMAR. Because each scenario reflects a different missing data mechanism, and the missing data mechanism is never known in actual studies, methods that produced estimates with low bais in all three scenarios can be considered "safer choices" than methods that produced estimates with low bias in some but not all of the scenarios.**For missing post-test scores,**none of the methods produced impact estimates with bias of less than 0.05 when missing data are NMAR. Therefore, requiring methods to produce estimates with low bias in all three scenarios would have left us with no methods to recommend to analysts facing missing outcome data in their studies. As a consequence, for missing post-test scores, we recmmend methods that produce estimates meeting our performace standard when missing data re oth MCAR and MAR, recognizing that even recommended methods may produce higher levels of bias under the NMAR condition.

In the discussions that follow, we intentionally include methods that are commonly criticized in the literature—listwise deletion and simple mean value imputation—for two reasons. First, the use of those methods is widespread in education. For example, a review of 545 published education studies by Peugh & Enders (2004) showed a nearly exclusive reliance on deletion as a way to deal with missing data. Second, because we are focusing on RCTs, we wanted to understand how different missing data strategies performed within this unique context, including commonly used but criticized methods.

**D. Guidance to Researchers**

*Basis for the recommendations*

Our recommendations for dealing with missing data in Group Randomized Traials in
education are based on the results of extensive statistical simulations of a typical
educational RCT in which schools are randomized to treatment conditions. As discussed
in Chapter 4, selected missing data methods were examined under conditions that
varied on three dimensions: (1) the ammount of missing data, relatively low (5%
missing) vs. relatively high (40% missing); (2) the level at which data are missing—at
the level of whole schools (the assumed unit of randomization) or for students within
schools; and, (3) the underlying missing data mechanisms discussed above (i.e. MCAR,
MAR and NMAR).

The performance of the selected missing data methods was assessed on the basis of
the bias that was found in both th estimated impact and the associated estimated
standard, using a set of standards that were developed from guidance currently in
use by the U.S. Department of Education's *What Works Clearinghouse* (see
Chapter 4 and Appedix E).

The recommendations that are provided below are based on the following criteria:

*Recommendations*

*Missing Pretest Scores Or Other Covariates*

When pretest scores or other covariates are **missing for students within schools**
in studies that randomize schools the simulation results lead us to recommend the
use of the following missing data methods:

- Dummy variable adjustment,
- Single stochastic regression imputation,
- Multiple stochastic regression imputation (i.e., "multiple imputation"), and
- Maximum Likelihood—EM algorithm with multiple imputation.

In this context, we would not recommend the use of three methods that produced impact estimates with bias that exceeded 0.05 in one or more of our simulations: case deletion, mean value imputation, and single non-stochastic regression imputation.

Alternatively, when data on baseline variables are missing for entire schools, the
simulation results lead us to recommend the use of the following methods:^{10}

- Case deletion,
- Dummy variable adjustment,
- Mean value imputation,
- Multiple stochastic regression imputation, and
- Maximum Likelihood—EM algorithm with multiple imputation.

We would **not** recommend the use of two methods that produced standard
error estimates with bias in at least one of our simulations that exceeded the WWC-based
threshold: single non-stochastic regression imputation and single stochastic regression
imputation.

Across the two scenarios—i.e., situations when pretest or covariate data are missing either for students within schools or for entire schools—three methods were found to be consistently recommended so are likely to be the best choices:

- Dummy variable adjustment,
- Multiple stochastic regression imputation, and
- Maximum Likelihood-EM algorithm with multiple imputation.

It is important to stress that these recommendations are specific to situations in which the intent is to make inferences about the coefficient on the treatment indicator in a group randomized trial. If, for example, an analyst wanted to make inference about the relationship between pretest and post-test scores, and there were missing values on the pretest scores, we would not recommend use of the dummy variable approach. With this method, the estimate of the coefficient on the pretest score is likely to be biased, as has been described in previous literature. But when the interest is on the estimate of the treatment effect, the dummy variable approach yields bias in the coefficient of interest—the estimated treatment effect—that falls within the acceptable range as we defined it for these simulations, and is similar in magnitude to the biases obtained from the more sophisticated methods.

*Missing Post-Test Scores Or Other Outcome Variables*

When data on outcome variables are missing for students within schools in studies
that randomize schools, the simulation results lead us to recommend the use of the
following methods:

- Case deletion,
- Single non-stochastic regression imputation,
- Single stochastic regression imputation,
- Multiple stochastic regression imputation,
- Maximum Likelihood-EM algorithm with multiple imputation,
- "Simple" weighting approach using the inverse of observed response rates,
- "Sophisticated" weighting approach that involved modeling non-response to create weights, and
- Fully-specified regression models with treatment-covariate interactions.

We would **not** recommend using mean value imputation because it was
the only method that produced impact estimates with bias that exceeded 0.05 under
the MAR scenario.

When data on dependent variables are **missing for entire schools**,
the simulation results lead us to recommend the use of the following methods:

- Case deletion,
- Multiple stochastic regression imputation,
- Maximum Likelihood-EM algorithm with multiple imputation,
- Sophisticated weighting approach, and
- Fully specified regression model with treatment-covariate interactions.

We would **not** recommend the use of the three methods that produced
standard error estimates with bias that exceeded the WWC-based threshold: mean value
imputation, single non-stochastic regression imputation, and single stochastic regression
imputation.

Across the two scenarios—i.e., situations when data on post-test or outcome data are missing either for students within schools or for entire schools—five methods were found to be consistently recommended so are likely to be the best choices:

- Case deletion,
- Multiple stochastic regression imputation,
- Maximum Likelihood-EM algorithm with multiple imputation,
- Sophisticated weighting approach, and
- Fully-specified regression models.

In addition, we recommend that if post-test scores are missing for a high fraction
of students or schools (e.g., 40%), analysts should control for pretest scores in
the impact model if possible. In our simulations, controlling for pretest scores
by using them as regression covariates reduced the bias in the impact estimate by
approximately 50 percent, and this finding was robust across different scenarios
and different methods.^{11}

As a final note, the recommendations provided above indicate that some methods that are easy to implement performed similarly to more sophisticated methods. In particular, where pretest scores were missing for either students or entire schools, the dummy variable approach performed similarly to the more sophisticated approaches and was among our recommended methods. And when post-test scores were missing for either students or entire schools, case deletion was among our recommended approaches. Consequently, we suggest that analysts take the ease with which missing data can be handled, and the transparency of the methods to a policy audience, into consideration when making a final choice of an appropriate method.

*Other Suggestions For Researchers*

In addition to the recommendations that we derive from the simulation results presented
in this report, we also recommend that all researchers adhere to the following general
analysis procedures when conducting any education RCT:

**What to do during the analysis planning stage?**Researchers should carefully describe, and commit to, a plan for dealing with missing data before looking at preliminary impact estimates or any outcome data files that include the treatment indicator variable. Committing to a design, and then sticking to it, is fundamental to scientific research in any substantive area. This is best accomplished by publishing the missing data plan prior to collecting the outcome data. If the original plan then fails to anticipate any of the missing data problems observed in the data, the plan should be updated and revised before any impact estimates are produced.

Researchers should also consider conducting sensitivity analysis to allow readers to assess how different the estimated impacts might be under different assumptions or decisions about how to handle missing data (see Chapter 3 for a discussion of one approach involving placing "best and worst case" bounds around the estimated impacts). These planned sensitivity tests should be specified ahead of time in the analysis plan.**What information should the impact report provide about missing data?**In their impact report, researchers should report missing data rates by variable, explain the reasons for missing data (to the extent known), and provide a detailed description of how missing data were handled in the analysis, consistent with the original plan.^{12}Impact reports should also provide key descriptive statistics for the study sample, including: (1) differences between the treatment and control group on baseline characteristics both at the point of random assignment and for the impact analysis sample (excluding, of course, any missing data imputations); and, (2) differences in baseline characteristics between treatment group respondents and non-respondents (i.e., those with and without outcome data), and similarly between control group respondents and non-respondents.

*Final Caveats*

Readers are cautioned to keep in mind that these simulation results are specific
to a particular type of evaluation—an RCT in which schools are randomized to experimental
conditions. Whether the key findings from these simulations would apply in RCTs
that randomize students instead of schools is an open question that we have not
addressed in this report. In addition, it is not clear whether the findings from
our simulations would be sensitive to changes in the key parameter values that we
set in specifying the data generating process and the missing data mechanisms. Finally,
we could not test all possible methods to address missing data. These limitations
notwithstanding, we believe these simulations yield important results that can help
inform decisions that researchers need to make when they face missing data in conducting
education RCTs.

Finally, despite all of the insights gained from the simulations, we cannot propose a fully specified and empirically justified decision rule about which methods to use and when. Too often, the best method depends on the purpose and design of the particular study and the underlying missing data mechanism that cannot, to the best of our knowledge, be uncovered from the data.