NCEE 2009-0049October 2009

## 1. Overview and Guidance

A. Introduction
Most statistics textbooks provide lengthy discussions of the theory of probability, descriptive statistics, hypothesis testing, and a range of simple to more complex statistical methods. To illustrate these discussions, the authors often present examples with real or fictional data—tidy tables of observations and variables with values for each cell. Although this may be entirely appropriate to illustrate statistical methods, anyone who does "real world" research knows that data are rarely, if ever, so complete. Some study participants may be unavailable for data collection, refuse to provide data, or be asked a question which is not applicable to their circumstances. Whatever the mechanism that causes the data to be missing, it is a common problem in almost all research studies.

This report is designed to provide practical guidance on how to address the problem of missing data in in the analysis of data in Randomized Controlled Trials (RCTs) of educational interventions, with a particular focus on the common educational situation in which groups of students such as entire classrooms or schools are randomized (called Group Randomized Trials, GRTs). The need for such guidance is of growing importance as the number of educational RCTs has increased in recent years. For example, the ten Regional Educational Laboratories (RELs) sponsored by the Institute of Education Sciences (IES) of the U.S. Department of Education are currently conducting 25 RCTs to measure the effectiveness of different educational interventions,1 and IES has sponsored 23 impact evaluations that randomized students, schools, or teachers since it was established in 2002.2

This report is divided into four chapters. Following a brief overview of the missing data problem, this first chapter provides our overall guidance for educational researchers based on the results of extensive data simulations that were done to assess the relative performance of selected missing data strategies within the context of the types of RCTs that have been conducted in education. Chapter 2 sets the stage for a discussion of specific missing data strategies by providing a brief overview of the design of RCTs in education, the types of data used in impact analysis, how these data can be missing, and the analytical implications of missing data. Chapter 3 describes a selection of methods available for addressing missing data, and Chapter 4 describes the simulation methodology and the statistical results that support the recommendations presented in this chapter. Appendices provide additional details on the simulation methods and the statistical results.

B. Missing Data and Randomized Trials
The purpose of a randomized controlled trial (RCT) is to allow researchers to draw causal conclusions about the effect, or "impact," of a particular policy-relevant intervention

(U.S. Department of Education, 2009). For example, if we wanted to know how students do when they are taught with a particular reading or math curriculum, we could obtain test scores before and after they are exposed to the new mode of instruction to see how much they learned. But, to determine if this intervention caused the observed student outcomes we need to know how these same students would have done had they not received the treatment.3 Of course, we cannot observe the same individuals in two places at the same time. Consequently, the RCT creates equivalent groups by randomly assigning eligible study participants either to a treatment group, that receives the intervention under consideration, or to a control group, that does not receive the particular treatment but often continues with "business as usual," e.g., the mode of instruction that would be used in the absence of a new math curriculum.4 5 Because of the hierarchical way in which schools are organized, most education RCTs randomize groups of students—entire schools or classrooms—rather than individual children to study conditions. In these GRTs the treatment is typically delivered at the group or cluster level but the primary research interest is the impact of the selected treatment on student outcomes, although it is also not uncommon to look for intermediate impacts on teachers or schools.

The advantage of the RCT is that if random assignment is properly implemented with a sufficient sample size, treatment group members will not differ in any systematic or unmeasured way from control group members except through their access to the intervention being studied (the groups are equivalent both on observable and unobservable characteristics). It is this elimination of confounding factors that allows us to make unbiased causal statements about the effect of a particular educational program or intervention by contrasting outcomes between the two groups.

However, for an RCT to produce unbiased impact estimates, the treatment and control groups must be equivalent in their composition (in expectation) not just at the point of randomization (referred to as the "baseline" or pretest point), but also at the point where follow-up or outcome data are collected. Missing outcome data are a problem for two reasons: (1) the loss of sample members can reduce the power to detect statistically significant differences, and (2) the introduction of non-random differences between the treatment and control groups can lead to bias in the estimate of the intervention's effect. The seriousness of the potential bias is related to the overall magnitude of the missing data rate, and the extent to which the likelihood of missing data differs between the treatment and control groups. For example, according to the What Works Clearinghouse6 the bias associated with an overall attrition rate of ten percent and a differential treatment-control group difference in attrition rates of five percent can be equal to the bias associated with an overall attrition rate of 30 percent and a differential attrition rate of just two percent.

Therefore, in a perfect world, the impact analysis conducted for an RCT in education would include outcomes for all eligible study participants defined at the time of randomization. However, this ideal is rarely ever attained. For example, individual student test scores can be completely missing because of absenteeism, school transfer, or parental refusal for testing. In addition, a particular piece of information can be missing because respondents refuse to answer a certain test item or survey question, are unable to provide the requested information, inadvertently skip a question or test item, or provide an unintelligible answer. In an education RCT, we have to also concern ourselves with missing data at the level of entire schools or classrooms if randomly assigned schools or classrooms opt either out of the study completely or do not allow the collection of any outcome data.

As demonstrated by Rubin (1976, 1987), the process through which missing data arise can have important analytical implications. In its most innocuous form—a category that Rubin calls Missing Completely at Random (MCAR)—the mechanism that generates missing data is a truly random process unrelated to any measured or unmeasured characteristic of the study participants. A second category—Missing at Random (MAR)—is one in which missingness is random conditional on the observed characteristics of the study sample. For example, the missing data would be MAR if missingness on the post-test score were related to gender, but conditional on gender—that is, among boys or among girls—the probability of missing data is the same for all students. Typically, if one can reasonably assume that missing data arise under either the conditions of MCAR or MAR the missing data problem can be considered "ignorable," i.e., the factors that cause missingness are unrelated, or weakly related, to the estimated intervention effect. In some situations, however, one cannot reasonably assume such ignorability–a category that Rubin calls Not Missing at Random (NMAR). 7

Within the context of an RCT, if the missing data mechanism differs between the treatment and control groups, dropping cases with missing data may lead to systematic differences between the experimental groups which can lead to biased impact estimates. Furthermore, even if the missing data mechanism is the same for the treatment and control groups, we may still be concerned if certain types of teachers or students are under- or over-represented in the analysis sample. For example, if the impacts for underrepresented groups are higher than the average impact, then the impact estimates will be biased downward; if the impacts for underrepresented groups are lower than the average impact, the impact estimates will be biased upward.

C. Missing Data Methods
As noted above, missing data is a common problem in educational evaluations. For example, in impact evaluations funded by the National Center for Educational Evaluation and Regional Assistance (NCEE), student achievement outcomes are often missing for 10-20 percent of the students in the sample (Bernstein, et al., 2009; Campuzano, et al., 2009; Constantine, et al., 2009; Corrin, et al., 2008; Gamse, et al., 2009; Garet, et al., 2008; and Wolf, et al., 2009).

Strategies used to address missing data in education RCTs range from simple methods like listwise deletion (e.g., Corrin, et al., 2009), to more sophisticated approaches like multiple imputation (e.g., Campuzano, et al., 2009). In addition, some studies use different approaches to addressing missing covariates and missing outcomes, such as imputing missing covariates but re-weighting complete cases to address missing outcomes (e.g., Wolf, et al., 2009). Despite the prevalence of the missing data challenge, there is no consensus on which methods should be used and the circumstances under which they should be employed. This lack of common standards is not unique to education research, and even areas where experimental research has a long history, like medicine, are still struggling with this issue. For example, guidance from the Food and Drug Administration (FDA) on the issue of missing data in clinical trials indicates that "A variety of statistical strategies have been proposed in the literature…(but) no single method is generally accepted as preferred" (FDA, 2006, p.29).

The selection of the methods that are the focus of this report was based on a review of several recent articles by statistical experts seeking to provide practical guidance to applied researchers (Graham, 2009; Schafer & Graham, 2002; Allison, 2002; and Peugh & Enders, 2004). Specifically, this report examines the following analysis strategies for dealing with the two types of missing data that are of primary importance when analyzing data from an educational RCT-missing outcome or post-test data and missing baseline or pretest data:

• Appropriate for Missing Pretest Data Only:
• Dummy Variable Adjustment—setting missing cases to a constant and adding "missing data flags" to the impact analysis model.
• Appropriate for Missing Post-test Data Only:
• Weighting—re-balancing the analysis sample to account for the loss of study participants.
• Fully-Specified Regression Models—adding to the impact analysis model terms that interact the covariates with the treatment indicator.
• Appropriate for Both Types of Missing Data:
• Imputation Methods—filling in missing values using one of four methods single mean imputation, single non-stochastic regression imputation, simgle stochastic regression imputation, and multiple stochastic regression imputation.
• Maximum Likelihood—EM Algorithm with Multiple Imputation—a statistical estimation method that tries to find the population parameters that are most likely to have produced a particular data sample, using all of the available observations including those with missing data.
• Selection Modeling and Pattern Mixture Modeling—two attempts to deal with the NMAR situation by statistically modeling the missing data mechanism.

In the discussions that follow, we intentionally include methods that are commonly criticized in the literature—listwise deletion and simple mean value imputation—for two reasons. First, the use of those methods is widespread in education. For example, a review of 545 published education studies by Peugh & Enders (2004) showed a nearly exclusive reliance on deletion as a way to deal with missing data. Second, because we are focusing on RCTs, we wanted to understand how different missing data strategies performed within this unique context, including commonly used but criticized methods.

D. Guidance to Researchers
Basis for the recommendations
Our recommendations for dealing with missing data in Group Randomized Traials in education are based on the results of extensive statistical simulations of a typical educational RCT in which schools are randomized to treatment conditions. As discussed in Chapter 4, selected missing data methods were examined under conditions that varied on three dimensions: (1) the ammount of missing data, relatively low (5% missing) vs. relatively high (40% missing); (2) the level at which data are missing—at the level of whole schools (the assumed unit of randomization) or for students within schools; and, (3) the underlying missing data mechanisms discussed above (i.e. MCAR, MAR and NMAR).

The performance of the selected missing data methods was assessed on the basis of the bias that was found in both th estimated impact and the associated estimated standard, using a set of standards that were developed from guidance currently in use by the U.S. Department of Education's What Works Clearinghouse (see Chapter 4 and Appedix E).

The recommendations that are provided below are based on the following criteria:

• In general, we recommend avoiding methods for which the simulations indicated a bias, in either the magnitude of the estimated impact or its assicated standard error, that exceeded 0.05 standard deviations of the outcome measure (a standard developed on the basis of the current WWC guidance on simple attrition).
• The recommendations are based on the simulation results in which data were missing for 40 percent of students or schools. This is because the alternative stet of simulation results in which data were missing for five percent of either students or schools showed that all of the tested methods produced results that fell within the WWC-based standard. We recogize that many studies in education will have lower missing data rates than 40 percent. However, our recommendations are designed to be conservative—avoiding methods that produce a large about of bais when the missing data rate is 40 percent will reduce the likelihood that an evaluation will suffer from bias of this magnitude if the missing data rate is less than 40 percent.
• We provide recommendations separately for missing pretest scores9 and for missing post-test scores (covariates):
• For missing pretests, recommended methods must have produced bias below the established thresholds for both impacts and standard errors. Boas in either estimate can lead to biased t-statistics and invalid statistical inference. Therefore, we only recommend methods that produce estimates with low bias for both impacts and stadard errors.

In addition, we only recommend methods that produce estimates with low bias in all three scenarios, i.e., MCAR, MAR and NMAR. Because each scenario reflects a different missing data mechanism, and the missing data mechanism is never known in actual studies, methods that produced estimates with low bais in all three scenarios can be considered "safer choices" than methods that produced estimates with low bias in some but not all of the scenarios.
• For missing post-test scores, none of the methods produced impact estimates with bias of less than 0.05 when missing data are NMAR. Therefore, requiring methods to produce estimates with low bias in all three scenarios would have left us with no methods to recommend to analysts facing missing outcome data in their studies. As a consequence, for missing post-test scores, we recmmend methods that produce estimates meeting our performace standard when missing data re oth MCAR and MAR, recognizing that even recommended methods may produce higher levels of bias under the NMAR condition.

Recommendations
Missing Pretest Scores Or Other Covariates
When pretest scores or other covariates are missing for students within schools in studies that randomize schools the simulation results lead us to recommend the use of the following missing data methods:

• Single stochastic regression imputation,
• Multiple stochastic regression imputation (i.e., "multiple imputation"), and
• Maximum Likelihood—EM algorithm with multiple imputation.

In this context, we would not recommend the use of three methods that produced impact estimates with bias that exceeded 0.05 in one or more of our simulations: case deletion, mean value imputation, and single non-stochastic regression imputation.

Alternatively, when data on baseline variables are missing for entire schools, the simulation results lead us to recommend the use of the following methods:10

• Case deletion,
• Mean value imputation,
• Multiple stochastic regression imputation, and
• Maximum Likelihood—EM algorithm with multiple imputation.

We would not recommend the use of two methods that produced standard error estimates with bias in at least one of our simulations that exceeded the WWC-based threshold: single non-stochastic regression imputation and single stochastic regression imputation.

Across the two scenarios—i.e., situations when pretest or covariate data are missing either for students within schools or for entire schools—three methods were found to be consistently recommended so are likely to be the best choices:

• Multiple stochastic regression imputation, and
• Maximum Likelihood-EM algorithm with multiple imputation.

It is important to stress that these recommendations are specific to situations in which the intent is to make inferences about the coefficient on the treatment indicator in a group randomized trial. If, for example, an analyst wanted to make inference about the relationship between pretest and post-test scores, and there were missing values on the pretest scores, we would not recommend use of the dummy variable approach. With this method, the estimate of the coefficient on the pretest score is likely to be biased, as has been described in previous literature. But when the interest is on the estimate of the treatment effect, the dummy variable approach yields bias in the coefficient of interest—the estimated treatment effect—that falls within the acceptable range as we defined it for these simulations, and is similar in magnitude to the biases obtained from the more sophisticated methods.

Missing Post-Test Scores Or Other Outcome Variables
When data on outcome variables are missing for students within schools in studies that randomize schools, the simulation results lead us to recommend the use of the following methods:

• Case deletion,
• Single non-stochastic regression imputation,
• Single stochastic regression imputation,
• Multiple stochastic regression imputation,
• Maximum Likelihood-EM algorithm with multiple imputation,
• "Simple" weighting approach using the inverse of observed response rates,
• "Sophisticated" weighting approach that involved modeling non-response to create weights, and
• Fully-specified regression models with treatment-covariate interactions.

We would not recommend using mean value imputation because it was the only method that produced impact estimates with bias that exceeded 0.05 under the MAR scenario.

When data on dependent variables are missing for entire schools, the simulation results lead us to recommend the use of the following methods:

• Case deletion,
• Multiple stochastic regression imputation,
• Maximum Likelihood-EM algorithm with multiple imputation,
• Sophisticated weighting approach, and
• Fully specified regression model with treatment-covariate interactions.

We would not recommend the use of the three methods that produced standard error estimates with bias that exceeded the WWC-based threshold: mean value imputation, single non-stochastic regression imputation, and single stochastic regression imputation.

Across the two scenarios—i.e., situations when data on post-test or outcome data are missing either for students within schools or for entire schools—five methods were found to be consistently recommended so are likely to be the best choices:

• Case deletion,
• Multiple stochastic regression imputation,
• Maximum Likelihood-EM algorithm with multiple imputation,
• Sophisticated weighting approach, and
• Fully-specified regression models.

In addition, we recommend that if post-test scores are missing for a high fraction of students or schools (e.g., 40%), analysts should control for pretest scores in the impact model if possible. In our simulations, controlling for pretest scores by using them as regression covariates reduced the bias in the impact estimate by approximately 50 percent, and this finding was robust across different scenarios and different methods.11

As a final note, the recommendations provided above indicate that some methods that are easy to implement performed similarly to more sophisticated methods. In particular, where pretest scores were missing for either students or entire schools, the dummy variable approach performed similarly to the more sophisticated approaches and was among our recommended methods. And when post-test scores were missing for either students or entire schools, case deletion was among our recommended approaches. Consequently, we suggest that analysts take the ease with which missing data can be handled, and the transparency of the methods to a policy audience, into consideration when making a final choice of an appropriate method.

Other Suggestions For Researchers
In addition to the recommendations that we derive from the simulation results presented in this report, we also recommend that all researchers adhere to the following general analysis procedures when conducting any education RCT:

• What to do during the analysis planning stage? Researchers should carefully describe, and commit to, a plan for dealing with missing data before looking at preliminary impact estimates or any outcome data files that include the treatment indicator variable. Committing to a design, and then sticking to it, is fundamental to scientific research in any substantive area. This is best accomplished by publishing the missing data plan prior to collecting the outcome data. If the original plan then fails to anticipate any of the missing data problems observed in the data, the plan should be updated and revised before any impact estimates are produced.

Researchers should also consider conducting sensitivity analysis to allow readers to assess how different the estimated impacts might be under different assumptions or decisions about how to handle missing data (see Chapter 3 for a discussion of one approach involving placing "best and worst case" bounds around the estimated impacts). These planned sensitivity tests should be specified ahead of time in the analysis plan.
• What information should the impact report provide about missing data? In their impact report, researchers should report missing data rates by variable, explain the reasons for missing data (to the extent known), and provide a detailed description of how missing data were handled in the analysis, consistent with the original plan.12 Impact reports should also provide key descriptive statistics for the study sample, including: (1) differences between the treatment and control group on baseline characteristics both at the point of random assignment and for the impact analysis sample (excluding, of course, any missing data imputations); and, (2) differences in baseline characteristics between treatment group respondents and non-respondents (i.e., those with and without outcome data), and similarly between control group respondents and non-respondents.

Final Caveats
Readers are cautioned to keep in mind that these simulation results are specific to a particular type of evaluation—an RCT in which schools are randomized to experimental conditions. Whether the key findings from these simulations would apply in RCTs that randomize students instead of schools is an open question that we have not addressed in this report. In addition, it is not clear whether the findings from our simulations would be sensitive to changes in the key parameter values that we set in specifying the data generating process and the missing data mechanisms. Finally, we could not test all possible methods to address missing data. These limitations notwithstanding, we believe these simulations yield important results that can help inform decisions that researchers need to make when they face missing data in conducting education RCTs.

Finally, despite all of the insights gained from the simulations, we cannot propose a fully specified and empirically justified decision rule about which methods to use and when. Too often, the best method depends on the purpose and design of the particular study and the underlying missing data mechanism that cannot, to the best of our knowledge, be uncovered from the data.

Top

1 The Regional Educational Laboratories serve the states in their regions with research and technical assistance, including both original studies—of which the RCTs are the largest—and syntheses of existing research. For more information on the RELs, see http://ies.ed.gov/ncee/edlabs/.
2 See ongoing and completed evaluation studies sponsored by IES's National Center for Education Evaluation and Regional Assistance at http://ies.ed.gov/ncee/projects/evaluation/index.aspyear.
3 For simplicity we use an example of a simple two-group design with a single treatment and control group. Real world RCTs can include a variety of combinations including multiple treatment arms, and designs in which there is no control group, i.e., the study compares outcomes across different treatments.
4 In some RCTs, the counterfactual represented by the control group reflects the conditions that would prevail in the absence of the intervention being tested. This counterfactual condition is often, and perhaps misleadingly, referred to as "business as usual." In other RCTs, the study is designed to compare the impacts of two alternative interventions.
5 In most cases, study participants are randomly assigned on an equal basis to the treatment and control groups. However, there are situations in which there is good reason to use unequal allocation, for example, where there is strong resistance to placing participants into the control group. Such variation from a 50:50 allocation will result in some loss of statistical power, but this is generally modest in magnitude.
6 U.S. Department of Education, (2008).
7 The category is also described in the literature as non-ignorable non-response (NINR).
8 This method may have only been used in the single study for which it was developed (Bell & Orr, 1994). However, we decided to include it in our review because it reflects a fundamentally different approach to missing data focusing on the re-specification of the analysis model.
9 Although the detailed results provided in Chapter 4 and Appendix D include models that include or exclude the pretest covariate, our recommendations are based on the simulations in which pretest scores were available. When pretest scores were not available, some of the methods we recommend produced impact estimates with bias of greater than 0.05 standard deviations. Our simulation results suggest that to produce impact estimates with the bias below this threshold depends on both the choice of methods and the availability of data on important covariates.
10 Note that the "simple" weighting approach cannot be applied when data are missing for entire schools, because it involves weighting up respondents from a given school to represent nonrespondents from that school; with school-level missing data there are no respondents to use for this purpose.
11 For example, when missing post-tests depended on the values of the post-test scores, and data were missing for 40 percent of students, including the pretest score as a covariate in the models reduced the bias from 0.124 to 0.67 for case deletion, and it reduced the bias from 0.122 to 0.061 for multiple stochastic regression imputation (see Appendix D, Table III.b.1).
12 Additional guidelines for reporting missing data patterns and procedures appear in Burton & Altman (2004).