A. Why Conduct RCTs?
The Randomized Controlled Trial (RCT) has long been a mainstay of medical research to examine the effectiveness of different types of health care services (e.g., approaches to medical and nursing practice) as well as technologies such as pharmaceuticals and medical devices. In recent years, RCTs have become the "gold standard" for social policy evaluation in a wide range of areas including education (U.S. Department of Education, 2008).
RCTs are well designed to solve the classic problem of causal inference (commonly referred to as the "Rubin Causal Model") that arises when we can observe outcomes for individuals in the group that receive the treatment but we cannot observe what would have happened if these same individuals had not received the selected intervention (e.g., Imbens & Wooldridge, 2009). For example, we cannot observe how the same class of students would have performed on a standardized test if they had been taught using a different curriculum or teaching method. All we can observe for the children is how they did when taught by their current teacher with whatever that entails in terms of the curriculum or pedagogical approach. To address this problem, random assignment produces a control group that differs systematically from the treatment group in only one way—receipt of the intervention being evaluated. Therefore, the control group yields information on how the treatment group would have fared under the counterfactual, or "untreated," condition.
As discussed in Chapter 1, the advantage of the RCT is that if random assignment is properly implemented (i.e., the process is truly random) with a sufficient sample size, program participants are not expected to differ in any systematic or unmeasured way from non-participants except through their access to the new instructional program.13 By eliminating the effect of any confounding factors, randomization allows us to make causal statements about the effect of a particular educational program or intervention, i.e., observed outcome differences are caused by exposure to the treatment. In fact, with a randomized design, if one has complete outcome data a simple comparison of treatment-control group average outcomes yields an unbiased estimate of the impact of the particular program or intervention on the study participants.
This certainty of attribution to the right causal factor can never be achieved if schools and staff make their own choices regarding, for example, the type of instruction used for mathematics. Too many things about the schools, teachers, and students could potentially differ, and this can undermine our ability to reliably attribute observed outcome differences to the single causal factor-the treatment condition. Although researchers have suggested a large number of non-experimental methods for achieving the same purpose such as multivariate regression, selection correction methods (Heckman & Hotz, 1989), and propensity score methods (Rosenbaum & Rubin, 1983), a long line of literature, including recent analyses by Bloom, et al. (2002), Agodini & Dynarski (2001), and Wilde & Hollister (2002), suggests that none of these methods provides causal attribution matching the reliability of random assignment.
B. RCTs in Education
RCTs have been used in education to estimate the impacts of a wide range of interventions, including evaluations of broad federal programs such as Upward Bound (Seftor, et al., 2009) and Head Start (Puma, et al., 2005), school reform initiatives such as Success for All (Borman, et al., 2007) and Comer's School Development Program (Cook, et al., 1999), and subject-specific instructional programs such as Accelerated Reader (Ross, et al., 2004; Bullock, 2005) and Connected Mathematics (REL-MA, 2008). In the case of instructional programs, the direct treatment that is being manipulated often involves training teachers in a new curriculum or instructional practice, and the trained teachers are then expected to implement the new approach in their classrooms. Because the primary interest of such RCTs is the impact on student learning, the actual treatment includes both the training itself plus how teachers, in fact, implement the new instructional method, including any real world adaptations and distortions of the expected intervention. For example, among the 25 RCTs currently being conducted by the IES-funded Regional Educational Labs (RELs), 20 are testing different models of instructional practice that include a teacher professional development component.14
Outside the field of education, it is common to randomly assign individuals to the treatment and control groups (e.g., individual patients who do or do not get a new drug regimen), but these are less common in the field of education. More typically, researchers conduct Group Randomized Trials (GRTs) in which the units of random assignment are intact groups of students—either entire schools or individual teachers and their classrooms—but the primary interest of the study is typically the impact of the selected treatment on student-level outcomes (although it is not uncommon to look for intermediate impacts on schools or teachers). This leads to a hierarchical, or nested, research design in which the units of observation are members of the groups that are the actual units of random assignment. As Murray (1998) describes, the groups that are the units of random assignment are not generally formed at random but there is some connection that creates a correlation among the individuals within the group. For example, 3rd grade students in the class of a particular teacher are likely to share a variety of similar characteristics. (This correlation has important analytic implications for the simulations described in Chapter 4.)
GRT designs are well illustrated by the collection of 25 experimental studies currently being conducted by the RELs of which 17 randomly assigned entire schools and six assigned classes/teachers within schools (two assigned individual students). The studies typically involve a single cohort of study participants, but five have multiple annual cohorts. Most (22) are conducting follow-up testing of students using some form of standardized test to measure student achievement, while eight are collecting scores on state assessments from administrative records (either as the sole outcome measure or in conjunction with study-administered testing). Some (5) are also measuring student non-achievement outcomes (e.g., course-taking, instructional engagement) from student surveys or school administrative records. Because many of the interventions involve teacher professional development, several (9) include measures of teacher practice and two have included teacher tests to gauge teacher knowledge. Follow-up data collection is typically conducted at a single point in time, approximately one year after randomization, but can also include multiple outcome testing points; essentially all of the studies collected baseline data to improve the precision of the impact estimates and identify student subgroups of particular interest (e.g., pre-intervention test scores, student demographic characteristics).
C. Defining the Analysis Sample
Because the unit of assignment is usually the school, classroom, or teacher, while the unit of analysis is the student, multi-level modeling is the typical approach to impact estimation in group RCTs to account for the associated clustering.15 These models include a dummy variable to distinguish the treatment group from the control group at the appropriate level, depending on the unit of assignment,16 and control variables at the student level, such as pre-intervention test scores, to increase the precision of the impact estimates. The estimated coefficient on the treatment indicator provides the study's estimate of the average effect of the intervention on all students randomized. Referred to as the "intent-to-treat" (ITT) effect, this estimate would, for example, capture the impact of a policy which made professional development available to teachers regardless of whether all of the individual teachers actually took part in the training. In other words, the ITT effect captures the impact of the offer of a particular intervention, not the impact of a school or a teacher actually participating in the intervention.
There are at least two reasons to focus on estimating the ITT effect. First, it is the true experimental impact estimate because all treatment and control group members are included in the analysis. Second, the ITT effect often reflects the impact of the feasible policy option-making a particular program available to a specified set of intended participants. That is, a program can be made available but whether it is implemented as intended, or whether all of the targeted participants actually get the intervention, is difficult if not impossible to control. For example, consider an RCT on teacher professional development that could inform a state policy decision on whether to offer a particular type of training to some or all schools in the state. In this case, state policy makers would benefit from evidence on the impacts of offering the professional development to schools—not on the effects of schools accepting the offer, teachers receiving the training, or other factors that are beyond the control of state policymakers.17
If all RCTs in education need to do a good job of estimating the ITT effect, what sample becomes the target for data collection? In estimating the ITT effect, we need to collect data on and analyze all students and schools that were randomly assigned. For example, it would be convenient to simply exclude students from the study if they move to a school that cannot provide data, but there may be systematic differences between the "stayers" in the two groups, especially if the treatment affects the probability of remaining in the school. Therefore, excluding the "movers" from the sample—or removing any other group from the sample on the basis of a factor that could have been affected by the treatment—undermines the internal validity that randomization was designed to ensure.18
Even treatment group members who do not get the intervention have to be part of the impact analysis sample. ITT analysis does not allow them to be omitted from the analytical sample, because their counterparts in the control group cannot be identified and similarly excluded to maintain the equivalence of the two groups.
Therefore, it is important to either obtain data on the full randomized sample, or, when this is not feasible, to select appropriate methods for addressing missing data. Hence, regardless of the research goal, the missing data methods in this report should be applied to the full randomized sample, and in the chapters that follow the different methods are assessed in terms of their ability to successfully deal with the potential bias that may be introduced when data are missing.
D. How Data Can Become Missing
Most statistical textbooks rarely deal with the real world situation in which study participants are entirely absent from the data set, or situations where particular study participants lack data on one or more analytical variables. For example, as defined in the Stata manual, "Data form a rectangular table of numeric and string values in which each row is an observation on all of the variables and each column contains the observations on a single variable."19 This "default" solution to missing data—referred to as listwise or casewise deletion, or complete case analysis—is a common feature of most statistical packages. This approach is simple to implement, but it can reduce the size of the available sample and associated statistical power, and, as discussed later in this report, may introduce bias in the impact estimate.
The most obvious way that data can become missing is a failure to obtain any information for a particular participant or observation, a situation called "unit non-response" in the survey literature. In education this can occur for several reasons. For example, individual student test scores may be missing because parents did not consent to have their child tested, students were absent or had transferred to another school, the child was exempted from taking the test (e.g., because of a disability or limited English language ability), or the student's classroom was unavailable for testing (e.g., a fire drill took place). In the case of test scores from administrative records, the school or district may have been unwilling or unable to provide data. Teacher data may be missing because of a refusal to complete a survey or take a test, extended absence from school, or transfer to another school.
In addition to the complete absence of data for a particular randomized study participant, an often more common problem is "item non-response" where respondents refuse to answer a particular question, are unable to provide the information ("don't know"), inadvertently skip a question or test item, or provide an unintelligible answer. Sometimes data may be "missing" by design in survey data because a particular question is not applicable.20 Or certain questions may be skipped to reduce burden on individual respondents, who receive only a subset of the full set of possible questions (called "matrix sampling").
Longitudinal studies in which data are collected from study participants at multiple time points, or "waves," present different missing data possibilities. For example, consider a study in which data were collected at four separate time points and data are available as shown in the example below:
|Student||Wave 1||Wave 2||Wave 3||Wave 4|
In this example, Student A was tested at all waves. Students B-E provided incomplete data, i.e. data were not obtained some data points (wave non-response). Student F only provided data at the first and second time points – this child was a "drop out" from the study (often called study attrition) because, for example, he left the study school. As will be discussed in the next chapter, these patterns of wave non-response can provide some opportunities for imputing or otherwise adjusting for the missing time points in the sequence, to the extent that outcome measures for a given student are likely to be correlated over time.21
E. The Missing Data Problem
As discussed in Chapter 1, there are two potential problems that can result from missing data.22 First, if the missing data mechanism is different for the treatment and control group, dropping cases with missing data can introduce systematic differences which can, in turn, lead to biased impact estimates. Second, even if the missing data mechanism is the same for the treatment and control groups, we may still be concerned about missing data if certain types of teachers or students are more likely to have missing data and thus are under-represented in the analysis sample. If the impact of the educational intervention varies, this can lead to biased impact estimates: for example, if the impacts for underrepresented groups are higher than the average impact, then the average impact estimates will be biased downward; alternatively, if the impacts for underrepresented groups are lower the average impact estimates will be biased upward.
We believe that both of these problems are serious. In the first case, the seriousness of the problem is probably noncontroversial: if missing data produces systematic differences between the complete cases in the treatment group and the complete cases in the control group, then the impact estimates will be biased. However, the second problem may warrant additional consideration. In many RCTs in education, the schools in the sample constitute a sample of convenience: they are not selected randomly and thus cannot formally be considered representative of any larger population (e.g., Bernstein, et al., 2009; Constantine, et al., 2009; and Garet, et al., 2008). Therefore, some analysts may argue that if the study is not designed to produce externally valid estimates, we should be less concerned about missing data problems that make the analysis sample "less representative." However, some RCTs in education do in fact select schools or sites randomly. Furthermore, in those RCTs that select a nonrandom sample of convenience— schools willing to participate in the study—the study's goal is presumably to obtain internally valid estimates of the intervention's impact for that sample of schools. If missing data problems lead to a sample of students with complete data in those schools that is not representative of all students in those schools, we believe this is a problem that should be addressed.
The bias that can be introduced by missing data can most easily be understood by considering a typical education RCT. For example, consider a study for which one has a primary student-level outcome variable, Y, such as a student assessment in reading, a treatment indicator "Trt" (Trt=1 if assigned to the treatment group, and =0 if assigned to the control group, which we assume is always known), and a set of covariates, X1 through Xn, all measured at the time of, or prior to, random assignment (e.g., a student's prior assessment score and other demographic variables). Although we could have missing data for the outcome variable Y or for any of the control variables X1 through Xn, for simplicity let us consider cases where only the outcome variable is missing for some observations.
As discussed in Chapter 1, the most innocuous form of missing data in the Rubin framework is called Missing Completely at Random (MCAR). In our example, this situation would hold if the probability of the outcome test score being missing is unrelated to the student's "true" test score (i.e., students who would score higher or lower at the point of outcome testing are not more or less likely to be missing) or to any of the other important measured student characteristics (e.g., gender or race). This condition would, however, be violated if, for example, students with low pretest scores were more likely to be missing the post-test score because, for example, they refused or were unable to complete the test, or their parents were more likely to fail to provide consent for the outcome testing. MCAR is a strong assumption and one that, in our view, may not be reasonable in most situations.23
The second category in Rubin's framework, Missing at Random (MAR), would hold if the probability of the outcome being missing is unrelated to a student's true test score after controlling for the other variables in the analysis.24 In other words, the missingness is random conditional on the observed X's. For example, the missing data would be MAR if missingness on Y was related to gender but conditional on gender—that is, among boys or among girls—the probability of missing data is a constant. This condition would, however, be violated if, for example, students missing the post-test score were those students who would have scored lower (had they been tested) than those students who were actually tested (those without missing data). It is impossible, of course, to determine if this condition exists because the data on the untested students is missing so one cannot compare the scores for the tested and not tested students.
If the assumptions of MCAR or MAR are true, the missing data mechanism can be considered "ignorable" (MCAR) or correctable (MAR); i.e., in effect the factors that cause missingness are unrelated (or weakly related) to the parameters to be estimated in the analysis. Under such conditions, a variety of techniques discussed in Chapter 3 are available to deal with missing data. In some situations, however, one cannot reasonably assume such ignorability, a category called Not Missing at Random (NMAR). In these situations, the methods that are available (discussed at the end of Chapter 3) require good information about the determinants of missingness to model the causal mechanism, and not surprisingly, the results of an impact analysis in a NMAR situation are quite sensitive to one's choice of assumptions and statistical approach.