Skip Navigation
Technical Methods Report: What to Do When Data Are Missing in Group Randomized Controlled Trials

NCEE 2009-0049
October 2009

Appendix C: Specifications for Missing Data Simulations

Introduction to Notation
The following notation is used throughout this appendix:

YPre,ij is a student achievement test score, measured at baseline (pre-treatment) for the ith student, nested in the jth school;
i = 1…60 (students per school); j = 1…60 (schools);
YPost,ij is a student achievement test score, measured at follow-up (post-treatment) for the ith student, nested in the jth school;
Femaleij = 1 if student is female, = 0 if male;
Female_Cenij is the grand-mean centered covariate for Female, obtained as
HiRiskij = 1 if student is high risk (e.g., low income), = 0 otherwise;
HiRisk_Cenij is the grand-mean centered covariate for HiRisk, obtained as
Trtj = 1 if school j was randomly assigned to the treatment condition, =0 if school j was randomly assigned to the control condition.

As part of the simulations, values of pretest and post-test scores were set to missing. The following variables represent the observed pretest and post-tests scores, where some of the scores are observed (non-missing) and others have missing values:

YmissPre,ij is a pretest achievement score of the ith student, nested in the jth school; Some values are missing, others are non-missing.
YmissPost,ij is a post-test achievement score of the ith student, nested in the jth school; Some values are missing, others are non-missing.

Some additional notation is introduced in subsequent sections.

Hypothetical Education RCT Used in the Simulations
The assumed study design for the simulations is a randomized controlled trial (RCT) with random assignment of schools to treatment and control conditions. The goal of the fictional study that forms the basis of the simulations is to estimate the average impact of the treatment on student achievement. Key features of our fictional RCT design include: (1) 60 schools, with 30 assigned to treatment and 30 assigned to control; (2) 60 students per school; (3) baseline data on gender, an unspecified risk factor (e.g., low income), and pretest or pre-intervention achievement data in a single subject area (either reading or mathematics); and (4) follow-up outcome data on achievement in the same subject area as the pretest.

Estimation of the average impact of the hypothetical intervention on student achievement is assumed to be done using a 2-level hierarchical linear model, where students (level-1) are nested in schools (level-2), and the model includes gender and high risk status as student-level covariates. However, two different models are assumed to be estimated for the simulations: (1) Model A does not include a student-level pretest score as a covariate, and (2) Model B does include a student-level pretest score as a covariate:

Model A. Pretest score not available
YPost,ij00j1(Trtj) +β2(Female_cenij) +β3(HiRisk_cenij) +εij

Model B. Pretest score is available
YPost,ij00j1(Trtj) +β2(Female_cenij) +β3(HiRisk_cenij) +β4(YPre,ij) +εij

In each model, α0j is a random school-level intercept that is assumed to be normally distributed with mean zero and variance τ2, i.e., α0j ~ N(0,τ2) . It is also assumed to be independent of εij, the student-level error term, and εij is assumed to be normally distributed with mean 0 and variance σ2, i.e., εij ~ N(0,σ2) . The coefficient βˆ1 provides an estimate of the Intent-to-Treat Effect, or, in the absence of noncompliance, the average impact of the treatment.

Later in this appendix, when we describe how the different missing data methods are implemented, we will refer back to these two generic analysis models to indicate how we estimated the treatment effect when data were missing.

Generation of the Simulated Data
This section describes the generation of data for a single simulated data set. The process described here was replicated 1,000 times, producing 1,000 simulated data sets. Letting missing data occur at random (within defined probabilities) many times, and then averaging the results of estimation models across the 1,000 data sets, ensures the robustness of the simulation findings and of any conclusions about the performance of various missing data methodologies drawn from them. Multiple replications also provide distributions of impact estimates and their standard errors, reflective of the sampling variability built into the data (and present in real data). Estimates from these multiple replications converge on population parameters; for example, if there were no missing data and we increased the number of generated data sets towards infinity, the mean of the parameter estimates from the many simulations would converge to the true population mean. For scenarios where there are missing data, we use all the replications to determine the closeness of the impact estimator's mean across the replications to the true population parameter. This serves as the measure of bias in the impact estimate.

Generating Demographic Characteristics
To generate the sex and academic risk indicators, we first generated 60 school IDs, and within each school generated 60 student IDs. Within each school, we set the value of Female to “0” for 30 of the students, and set the value of Female to “1” for the remaining 30 students. Within each school, 12 students (6 females and 6 males) had the value of HiRisk set to “1,” the remaining 48 students had the value of HiRisk set to “0.” To summarize:

  • each data set included 60 schools;
  • each school consisted of 60 students—30 students (50%) were Female, and 12 students (20%) were HiRisk and,
  • Female was independent of HiRisk.

Generating Pretest Scores
To generate pretest scores, we began by generating random errors terms for schools and students. To generate random school effects, we used the Normal function in SAS to generate 60 random normal deviates from a normal distribution with mean equal 0 and variance equal to 0.10. As will be shown subsequently, these will represent the deviations of each of the 60 school’s individual intercepts from the grand mean intercept. In model notation, these are the values of α0j, generated α0j ~N(0, τ2), where τ2 is set to equal 0.10. In each simulated data set, each of the 60 schools was assigned one of these values. All students within a particular school shared the same common value on the school-level random deviate.

In the next step, we again used SAS's Normal function to generate values from a normal distribution. This time we generated 3,600 values from a distribution with mean equal 0 and variance equal to 0.90, corresponding to the 60 students within each of the 60 schools. Each of the simulated students was assigned a value from this random normal distribution. These values correspond to the random deviation terms, εij, that represent the difference of an individual student's pretest score from his/her school's average value, and conditional on the student's covariate value. To summarize, we generated:

  • school-level random effects, i.e., 60 values of α0j, from a normal distribution with mean 0 and variance 0.10, and
  • student-level random error terms, i.e., 3,600 values of εij, from a normal distribution with mean zero and variance 0.90.

Next, we generated the values of each student's pretest (i.e., baseline) achievement score. The value of each student's pretest score was generated as a function of:

  • a grand-mean intercept;
  • student's gender;
  • student's status on the HiRisk variable;
  • the school-level mean pretest score, specifically, the school's deviation from the grand-mean intercept, α0j; and,
  • student-level residual error, εij

Using the values of the variables as described above, each student's pretest achievement score, YPre,ij, was generated from the following equation:

YPre,ij = β01(Female_cenij) +β2(HiRisk_cenij) +α0jij

where:
          β0 = 0
          β1 = 0.20
          β2 = -0.80
          α0j ~N(0,0.1)
          εij ~N(0,0.9)

(See Chapter 4 for citations to justify our choices of β1 and β2.) Note that the mean of YPre,ij is,

      =Mean0) +Mean1(Female_cenij)) +Mean2(HiRisk_cenij)) +Mean0j) +Mean>ij)
      = 0 +(β1)(0) +β2(0) +0 +0
      = 0

And note that the leve-1 (student-level) variance of YPre,ij is,

      =Var0) +Var1(Female_cenij)) +Var2(HiRisk_cenij)) +Var0j) +Var>ij)
      = 0 +(β1)2Var(Female_cenij) +β22(HiRisk_cenij) +0 + Varij)
      = 1.01

The level-2 (school-level) variance of YPre,ij is,

      =Var0j)       =0.10

Thus, the intraclass correlatin (ICC) of the pretest scores is,

Generating Post-test Scores
To generate post-test scores, we began by generating random deviates for schools, α0j* ~ N(0,τ2), and students, εij* ~ N(0,σ2), where the stars are used to indicate that these are different sets than the random deviates used to create the pretest scores. The value of each student's post-test score was generated as a function of:

  • a grand-mean intercept;
  • student's gender;
  • student's status on the HiRisk variable;
  • student's pretest achievement score, YPre,ij;
  • treatment status, Trtj;
  • a negative interaction effect of treatment by pretest (Trtj* YPre,ij)—the treatment effect is larger for students with lower pretest scores than for students with higher pretest scores;
  • the school-level mean post-test score, or put differently, the school's deviation from the grand-mean, α0j*; and,
  • student-level residual error, εij*.

Post-test scores are generated from the following model:

YPost,ij0*1*(Female_cen) +β2*(HiRisk) +β3*(YPre,ij) +β4*(Trtj) +β4*(Trtj * YPre,ij) +C0j*ij*)

where:80
          β0*= 0
          β1*= 0.02
          β2*=–0.05
          β3*= √0.50
          β4*= 0.20
          β5*=–0.20/3
          C = √0.50
          α0j* ~ N(0,0.1)
          εij* ~ N(0,0.9)

The process described above, to generate a single data set, was replicated 1,000 times to generate 1,000 data sets. Because we generated random values of α0j, εij, α0j*, and εij* from the distributions described above, each data set was different from all the others.

Missing Data Mechanism
The process described in the previous section was used to generate 1,000 complete data sets—that is, data sets without any missing values. In this section, we describe the process by which we generated missing values. Effectively, this involved selecting a random subsample from each of the 1,000 randomly generated samples, and for each subsample, setting the value of either the pretest or the post-test to missing.

The first step in this process involved specifying how the subsample would be selected. In particular, we specified the probability that the pretest or post-test would be missing from the data. Then these probabilities were used to select the subsample of cases for which the pretest or post-test would be set to missing.

In one set of simulations, we assumed that data were missing for a sample of students in each school. For these simulations, we randomly selected individual students within schools and set the value of their pretest score or post-test score to missing. In another set of simulations, we assumed that data were missing for entire schools (i.e., all students with a school had missing values). For these simulations, we randomly selected subsamples of schools and set the value of the pretest score or post-test score to missing for all students in these schools.

Missing Data Mechanism for Students
First, we generated missing values for those simulations in which data were missing for a sample of students in each school. We generated the missing indicator for three base scenarios--or missing data mechanisms—and for each base scenario, generated missing values such that either 5 percent of cases were missing or 40 percent of cases were missing.81 For each combination, we also generated data in which the pretest was set to missing and data in which the post-test was set to missing. None of our simulations set both variables to missing simultaneously, and no other variables were set to missing (e.g., female or risk status) in any of our simulations.

Scenario I - Missingness depends on treatment assignment.

In this scenario, the missing data mechanism is dependent only on treatment assignment. In particular, the missing data rates are higher for students in control schools than for students in treatment schools. But within each group, missing cases are a simple random sample of all cases in the group.

     Sub-scenario I, Missing data rate = low (5% overall)

For some simulations, we set either the pretest score or post-test score to missing for 5 percent of students in the sample. To do this, we first created a variable which indicated the probability of missing data, and we called it MissProb.. For treatment students (Trt = 1), MissProb was set to 0.04, and for control students (Trt = 0), MissProb was set to 0.06.

We then used SAS's RanBin function to generate values of 0 or 1 from a binomial distribution. The probability of generating a value of 1 was set to MissProb, and the new 0-1 variable was called MissIndicator (e.g., MissIndicator = RanBin(0,1, MissProb)).

Finally, we created values for the observed pretest scores and post-test scores, given that some values had been set to missing and would not be observed in the analysis. The observed pretest variable was set equal to the actual pretest score when the pretest score was non-missing; the observed pretest variable was set to a numeric missing value when the actual pretest score was missing from the data (e.g., not observed), as shown below:

  • YmissPre,ij =YPre,ij
  • YmissPost,ij =YPost,ij
  • If MissIndicator=1 then YmissPre,ij = . (set to missing)
  • If MissIndicator=1 then YmissPost,ij = . (set to missing)

     Sub-scenario II, Missing data rate = high (40% overall)

The missing data mechanism for this scenario was the same as that described in the previous section, except that the missing data rates were higher for both treatment schools and control schools. In particular:

  • if Trt = 1 then MissProb = 0.35
  • if Trt = 0 then MissProb = 0.45

All other steps were the same as described above.

Scenario II, Missingness depends on treatment assignment, pretest scores, and the interaction of the two.

The process we used for setting pretest and post-test scores to missing was the same as the process described for Scenario I, except that probability of missing values (MissProb) was dependent upon assignment to treatment, the pretest score, and the interaction of treatment assignment and pretest score. In particular:

  • The missing data rate is higher in the control group than the treatment group;
  • The missing data rate is higher for students with low pretest scores in both groups; but
  • The relationship between pretest and the missing data rate is much stronger in the control group than in the treatment group (to generate a difference in the missing data mechanism between the two groups).

     Sub-scenario I, missing data rate = low (5% overall

For this subscenario, we set the missing data probability (MissProb) for each student as follows:

Quartile on Pretrest Score If Trt=1 Set MissProb to: If Trt=0 Set MissProb to:
4 (>= 0.717) .03 .03
3 (>= 0.011, < 0.717) .04 .05
2 (>= -0.703, < 0.011) .04 .07
1 (< -0.703) .05 .09
Overall Average .04 .06

Note that in both treatment and control groups, the probability of missing data is higher for students with lower pretest scores, but that difference between the probabilities at the lowest and highest pretest quartiles is much greater in the control group than in the treatment group.

     Sub-scenario II, missing data rate = high (40% overall)

For this subscenario, we set the missing data probabilities (MissProb) for each student as follows:

Quartile on Pretrest Score If Trt=1 Set MissProb to: If Trt=0 Set MissProb to:
4 (>= 0.717) .30 .30
3 (>= 0.011, < 0.717) .35 .40
2 (>= -0.703, < 0.011) .35 .50
1 (< -0.703) .40 .60
Overall Average .35 .45

Scenario III, Missingness depends on treatment assignment, post-test scores, and the interaction of the two.

In particular:

  • The missing data rate is higher in the control group than the treatment group;
  • The missing data rate is higher for students with low post-test scores in both groups; but
  • The relationship between post-test and the missing data rate is much stronger in the control group than in the treatment group (to generate a different missing data mechanism for the two groups).

     Sub-scenario I, missing data rate = low (5% overall)

For this subscenario, we set the missing data probability (MissProb) for each student as follows:

If Trt=1 If Trt=0
And Quartile on Post-Test Score is Then Set MissProb to: And Quartile on Post-Test Score is Then Set MissProb to:
4 (>= 0.865) .03 4 (>= 0.695) .03
3 (>= 0.205, < 0.865) .04 3 (>= 0.004, < 0.695) .05
2 (>= -0.457, < 0.205) .04 2 (>= -0.691, < 0.004) .07
1 (< -0.457) .05 1 (< -0.691) .09
Overall Average .04   .06

Note that for post-test scores, unlike pretest scores, the quartile cutoffs differ between the treatment and control groups due to the effect of the treatment on post-test scores.

     Sub-scenario II, missing data rate = high (40% overall)

For this subscenario, we set the missing data probability (MissProb) for each student as follows:

If Trt=1 If Trt=0
And Quartile on Post-Test Score is Then Set MissProb to: And Quartile on Post-Test Score is Then Set MissProb to:
4 (>= 0.865) .30 4 (>= 0.695) .30
3 (>= 0.205, < 0.865) .35 3 (>= 0.004, < 0.695) .40
2 (>= -0.457, < 0.205) .35 2 (>= -0.691, < 0.004) .50
1 (< -0.457) .40 1 (< -0.691) .60
Overall Average .35   .45

     Other subscenarios

Under Scenario III, when the missing data mechanism depends on the post-test, we tested selected missing data methods at three different missing data rates between 5 percent and 40 percent: 10 percent, 20 percent, and 30 percent. The values of MissProb for these three missing data rates are provided below:

Missing Data Rate = 10 Percent
If Trt=1 If Trt=0
And Quartile on Post-Test Score is Then Set MissProb to: And Quartile on Post-Test Score is Then Set MissProb to:
4 (>= 0.865) .06 4 (>= 0.695) .06
3 (>= 0.205, < 0.865) .08 3 (>= 0.004, < 0.695) .10
2 (>= -0.457, < 0.205) .08 2 (>= -0.691, < 0.004) .14
1 (< -0.457) .10 1 (< -0.691) .18
Overall Average .08   .12

Missing Data Rate = 20 Percent
If Trt=1 If Trt=0
And Quartile on Post-Test Score is Then Set MissProb to: And Quartile on Post-Test Score is Then Set MissProb to:
4 (>= 0.865) .14 4 (>= 0.695) .14
3 (>= 0.205, < 0.865) .17 3 (>= 0.004, < 0.695) .20
2 (>= -0.457, < 0.205) .17 2 (>= -0.691, < 0.004) .26
1 (< -0.457) .20 1 (< -0.691) .32
Overall Average .17   .23

Missing Data Rate = 30 Percent
If Trt=1 If Trt=0
And Quartile on Post-Test Score is Then Set MissProb to: And Quartile on Post-Test Score is Then Set MissProb to:
4 (>= 0.865) .22 4 (>= 0.695) .22
3 (>= 0.205, < 0.865) .26 3 (>= 0.004, < 0.695) .30
2 (>= -0.457, < 0.205) .26 2 (>= -0.691, < 0.004) .38
1 (< -0.457) .30 1 (< -0.691) .46
Overall Average .26   .34

Missing Data Mechanism for Schools
In some RCTs, the missing data problem results from a lack of cooperation from schools and districts. Therefore, to account for this possibility, we ran a set of simulations under the assumption that data were missing for either 5 percent or 40 percent of schools— instead of for 5 percent or 40 percent of students within each school. The process used to generate missing values for all students in selected schools was largely parallel to the process used to generate missing values for selected students in each school.

However, for schools, the process for setting the missing data indicator to “1” operated at the school level. When school had a value of “1” on the missing data indicator, all pretest or all post-test scores within that school were set to missing. For example, for Scenario 1, the probability of missing data for an entire school was set to 4 percent for treatment schools and 6 percent for control schools. Within those schools—if picked as missing data cases---all pretest scores or all post-test scores were set to missing. For Scenarios II and III, quartiles were created from school-level means of pretest scores or post-test scores. However, the missing data probabilities were set to exactly the same values as shown in the previous section for missing students within schools.

Missing Data Methods
The following missing data methods were tested in the simulations under each of the missing data scenarios described in the previous section:

  • Case deletion,
  • Dummy variable adjustment,
  • Mean value imputation,
  • Non-stochastic regression imputation,
  • Single stochastic regression imputation,
  • Multiple stochastic regression imputation,
  • Maximum Liklihood-EM algorithm with multiple imputation,
  • Simple weighting,
  • Sophisticated weighting, and,
  • Fully-specified regression models with treatment/covariate interactions.

Case Deletion
Case deletion means simply that, if there is a missing value for any variable used in the model, the entire observation (student or school) is omitted from the analysis. This method is also known as complete case analysis because only observations that have complete data (no missing values) for every variable in the model are used in the analysis.

Therefore, regardless of whether we are missing pretest scores or post-test scores, and regardless of whether data are missing for students within schools or for entire schools, we implemented case deletion by dropping the cases with missing values.

To estimate the treatment effect once cases had been deleted, we estimated either Model A or Model B, as described in the Generic Analysis Plan presented earlier in this appendix. When pretest score was missing for a fraction of the sample, we estimated Model B. When post-test score was missing for a fraction of the sample, the model we estimated depended on whether pretest scores were available or unavailable:

  • When pretest scores were available, we estimated Model B.
  • When pretest scores were unavailable, we estimated Model A.

Dummy Variable Adjustment (Missing Pretest Scores Only)
The dummy variable adjustment required the creation of two new variables, Y.dvPre,ij, and Dummyij, defined as follows:

T.dvPre,ij = YmissPre,ij if YmissPre,ij is non-missing
  = 0 if YmissPre,ij is missing
DummyPre,ij = 1 if YmissPre,ij is missing
  = 0 if YmissPre,ij is non-missing

The analytical model used to estimate the treatment effect is similar to Model B, but the true value of the pretest is replaced by T.dvPre,ij, and the dummy variable DummyPre,ij is added to the model, as shown below:

YPost,ij00j1(Trtj) +β2(Female _ cenij) +β3(HiRisk _ cenij) +β4(Y.dvPre,ij) +β5(Dummyij) +εij

Mean Value Imputation
     Missing Pretest Scores
When pretest scores are missing for a fraction of the sample, mean value imputation involves replacing the missing values of the pretest score with the mean of the non-missing values of the pretest score for students in the same group (treatment or control). The data were first divided into the two groups—the treatment group and the control group. In the treatment group, the variable YmissTreat.Pre,ij was created as the mean of all non-missing values of YmissPre,ij. Similarly, for the control group, the variable YmissControl.Pre,ij was created as the mean of all non-missing values of YmissPre,ij. Finally, the variable Y.mvPre,ij, was created as:

Y.mvPre,ij = YmissPre,ij if YmissPre,ij is non-missing
  = YmissTreat.Pre,ij if YmissPre,ij is missing and the student is in treatment group
  = YmissControl.Pre,ij if YmissPre,ij is missing and the student is in control group

The analytical model used to estimate the treatment effect is similar to Model B, but where the pretest variable with missing values is replaced by Y.mvPre,ij, as shown below:

YPost,ij00j1(Trtj) +β2(Female _ cenij) +β3(HiRisk _ cenij) +β4(Y.mvPre,ij) +εij

     Missing Post-test Scores
When post-test scores are missing for a fraction of the sample, mean value imputation is conducted just as we conducted it for missing pretest scores. For each group, treatment and control, we replaced the missing post-test values by the mean of the non-missing post-test score for students in the same group—that is, separately for the treatment and control groups—to create the outcome variable Y.mvPre,ij.

For the simulations where we assumed pretest scores were available for the entire sample, the analytical model used to estimate the treatment effect is similar to Model B, but the true value of the post-test is replaced by Y.mvPre,ij, as shown below:

Y.mvPost,ij00j1(Trtj) +β2(Female _ cenij) +β3(HiRisk _ cenij) +β4(YPre,ij) +εij

For the simulations where we assumed pretest scores were not available for any sample members, the analytical model used to estimate the treatment effect is similar to Model A, but the post-test variable with missing values is replaced by Y.mvPost,ij, as shown below:

Y.mvPost,ij00j1(Trtj) +β2(Female _ cenij) +β3(HiRisk _ cenij) +εij

Non-stochastic Regression Imputation
This method involves the replacement of missing values with predicted values from regression models. First we describe our approach to imputing values and analyzing the data when data were missing for students within schools; then we describe our approach to imputing values and analyzing the data when data are missing for entire schools.

     Missing Pretest Scores for Students Within Schools
The data were first divided into the two groups—the treatment group and the control group. For the treatment group, we fit an imputer's model with the following form:

treatment group imputer model

where Schj =1 if student is in school j, and = 0 else. Note the use of school fixed effects (e.g., school dummy variables) in this imputer's model instead of the random intercept terms for schools that are used in the analytical model used to estimate the treatment effect. This approach is consistent with the recommendations in Reiter, Raghunathan, and Kinney (2006).82

For treatment students, we obtained a predicted value of the pretest score as:

predicted value of the pretest score

For control students, we estimated the same imputers model estimated for treatment students:

imputers model

Then we used this model to produce predicted pretest scores for control students. Note that we put stars on parameters and estimates to emphasize that the model estimates for the control group are not identical to the model estimates for the treatment group:

predicted pretest scores for control students

Finally, we created a new variable, Y.nriPre,ij, defined as follows:

Y.nriPre,ij = YmissPre,ij if YmissPre,ij is non-missing
  = ŶTreat.Pre,ij if YmissPre,ij is missing and the student is in the treatment group
  = ŶControl.Pre,ij if YmissPre,ij is missing and the student is in the control group

The analytical model used to estimate the treatment effect is similar to Model B, but the pretest variable with missing values is replaced by Y.nriPre,ij, as shown below:

YPost,ij00j1(Trtj) +β2(Female _ cenijij ) +β3(HiRisk _ cenij) +β4(YPost,ij) +εij

     Missing Pretest Scores for Entire Schools
When schools had missing pretest scores (i.e., every student within a school had missing pretest values), we aggregated data to the school level, then used non-stochastic regression imputation to obtain predicted values of school-level mean pretest scores, then replaced the missing, school-level mean pretest scores with the imputed values, and then conducted impact analyses using the school-level aggregate data. We describe this process in more detail below.

For each school, the following school-level means were created from the observed (nonmissing) student-level data:

YPre,j = the mean of YPre,ij over all students in school j (i.e., the school-level mean pretest score for schools with non-missing pretest scores)
YPost,j = the mean of YPost,j over all students in school (i.e., school level mean post-test score)
Female_cen.j = the mean of Female_cenij over all students in school j (i.e., the proportion of students in the school who are female)
HiRisk_cen.j = the mean of HiRisk_cenij over all students in school j (i.e., the proportion of students in the school who are at high-risk)

For the treatment group, we fit an imputer's model of the following form:

YmissPre.j01(Female_cen.j)2(HiRisk_cen.j)3(YPost.j) +εj

Then we computed the predicted value from the regression for each school:

ŶTreat.Pre.j =β̂0 +β̂1(Female_cen.j) +β̂2(HiRisk_cen.j) +β̂3(YPost.j)

For the control group, we repeated the same steps. More specifically, we fit an imputer's model of the following form:

YmissPre.j*0*1(Female_cen.j)*2(HiRisk_cen.j)*3(YPost.j) +εj

The stars on the betas emphasize that the model estimates for the control group are not identical to the model estimates for the treatment group. For control schools, we computed the predicted value from the regression for each school:

ŶControl.Pre.j =β̂*;0 +β̂*;1(Female_cen.j) +β̂*;2(HiRisk_cen.j) +β̂*;3(YPost.j)

Finally, we created a new pretest variable, Y.nriPre.j as follows

Y.nriPre,ij = YmissPre,ij if YmissPre,ij is non-missing
  = ŶTreat.Pre,ij if YmissPre,ij is missing and the student is in the treatment group
  = ŶControl.Pre,ij if YmissPre,ij is missing and the student is in the control group

The analytical model used to estimate the treatment effect is different from the Model B because the data has been aggregated to the school level. Therefore, we estimate a school-level analysis model, as shown below:

YPost.j01(Trtj) +β2(Female_cen.j) +β3(HiRisk_cen.j) +β4(Y.nriPre.j) +εj

     Missing Post-test Scores for Students Within Schools
For missing post-test scores for students within schools, we took an almost identical approach to the imputation approach described earlier for addressing missing pretest scores for students within schools. However, instead of using the post-test to impute the pretest, we used the pretest to impute the post-test. The resulting outcome measure Y.nriPost.ij equals the true value when it is observed and the imputed value when the true value is missing.

For the simulations where we assumed pretest scores were available for the entire sample, the analytical model used to estimate the treatment effect is similar to Model B, but the post-test variable with missing values is replaced by Y.nriPost.ij, as shown below:

Y.nriPost,ij00j1(Trtj) +β2(Female _ cenij) +β3(HiRisk _ cenij) +β4(YPre,ij) +εij

For the simulations where we assumed pretest scores were not available for any sample members, the analytical model used to estimate the treatment effect is similar to Model A, but the post-test variable with missing values is replaced by Y.nriPost.ij, as shown below:

Y.nriPost,ij00j1(Trtj) +β2(Female _ cenij) +β3(HiRisk _ cenij) +εij

     Missing Post-test Scores for Entire Schools
For missing post-test scores for entire schools, we took an almost identical approach to the imputation approach described earlier for addressing missing pretest scores for entire schools—except that we used the school's mean pretest score to impute the school's mean post-test score, instead of the reverse.

When pretest data are available, the analytical model used to estimate the treatment effect is different from the Model B because the data has been aggregated to the school level. Therefore, we estimate a school-level analysis model, as shown below:

Y.nriPost,ij00j1(Trtj) +β2(Female _ cenij) +β3(HiRisk _ cenij) +β4YPre.jij

When pretest data are not available, the analytical model used to estimate the treatment effect is different from the Model A—again because the data has been aggregated to the school level. Therefore, we estimate a school-level analysis model, as shown below:

Y.nriPost,ij00j1(Trtj) +β2(Female _ cenij) +β3(HiRisk _ cenij) +εij

Single Stochastic Regression Imputation
     Missing Pretest Scores for Students Within Schools
When pretest data are missing for students within schools, the procedure we used for implementing single stochastic regression imputation builds on the procedures we used for implementing non-stochastic regression imputation. However, in single stochastic regression imputation, a randomly selected residual is added to the predicted value from the imputer's model. For the treatment group, we fit the same imputer's model as for non-stochastic regression imputation; generate predicted values from the model, ŶTreat.Pre,ij; use the model to generate level-1 residuals, rij ; and create a new outcome variable Y.sriPre,ij. This new outcome variable equals the true value when it is observed, and it equals ŶTreat.Pre,ij + rij when the true value is missing, where rij is a randomly selected residual. Finally, we repeat the process separately for the control group.

The analytical model used to estimate the treatment effect is similar to Model B, but the pretest variable with missing values is replaced by Y.sriPre,ij, as shown below:

YPost,ij00j1(Trtj) +β2(Female _ cenij) +β3(HiRisk _ cenij) +β4(Y.sriPre,ij) +εij

     Missing Pretest Scores for Entire Schools
When pretest data are missing for entire schools, the procedure for implementing single stochastic regression imputation is almost the same as that described earlier for non-stochastic regression imputation—except that a randomly selected residual is added to the predicted value. For the treatment group, we create a file of school-level means; fit the same school-level imputer's model; generate predicted values from the model for each treatment school, ŶTreat.Pre.j; use the model to generate school-level residuals, rjk. We Treat.Pre.j repeat this process for the control group to generate predicted values from the model for each control school, ŶControl.Pre.j and school-level residuals rjl*.

From these estimates, a new pretest variable is created as follows;

Y.sriPre.j = YmissPre.j if YmissPre.j is non-missing
  = ŶTreat.Pre.j +rjk if YmissPre.j is missing and student is in treatment group
  = ŶControl.Pre.j +rjk if YmissPre.j is missing and student is in control group

where rjk is a randomly selected residual from the treatment group, and rjl* is a randomly selected residual from the control group.

The analytical model used to estimate the treatment effect was of the form:

YPost.j00j1(Trtj) +β2(Female _ cenj) +β3(HiRisk _ cenj) +β4(Y.sriPre.j) +εj

     Missing Post-test Scores for Students within Schools
When post-test data are missing for students within schools, the procedure we used for implementing single stochastic regression imputation builds on the procedures we used for implementing non-stochastic regression imputation. However, in single stochastic regression imputation, a randomly selected residual is added to the predicted value from the imputer's model. For the treatment group, we fit the same imputer's model as for non-stochastic regression imputation, which uses pretest scores to impute post-test scores; generate predicted values from the model, ŶTreat.Post,ij; use the model to generate level-1 residuals, rij; and create a new outcome variable YPost,ij. This new outcomevariable equals the true value when it is observed, and it equals ŶTreat.Post,ij + rij when the true value is missing, for a randomly selected residual rij. Finally, we repeat the processseparately for the control group.

For scenarios where pretest data were available, the analytical model used to estimate the treatment effect was of the form:

Y.sriPost,ij00j1(Trtj) +β2(Female _ cenij) +β3(HiRisk _ cenij) +β4(YPre,ij) +εij

For scenarios where pretest data were not available, the analytical model used to estimate the treatment effect was of the form:

Y.sriPost,ij00j1(Trtj) +β2(Female _ cenij) +β3(HiRisk _ cenij) +εij

     Missing Post-test Scores for Entire Schools
An analogous imputation procedure to that described for missing pretest scores of entire schools was used. To obtain imputed values for the treatment group, we create a file of school-level means; fit a school-level imputer's model to predict post-test school means; generate predicted values from the model for each treatment school; use the model to generate residuals; add residuals to predicted values to obtain imputed values; and replace missing values with imputed values. The process is repeated for the control group.

For scenarios where pretest data were available, the analytical model used to estimate treatment impact was of the form:

Y.sriPost,ij00j1(Trtj) +β2(Female _ cenij) +β3(HiRisk _ cenij) +β4(YPre.j) +εij

For scenarios where pretest data were not available, the analytical model used to estimate treatment impact was of the form:

Y.sriPost,ij00j1(Trtj) +β2(Female _ cenij) +β3(HiRisk _ cenij) +εij

Multiple Stochastic Regression Imputation Multiple stochastic regression imputation is conducted in the same manner as single stochastic regression imputation, except that we produced five imputed values for each missing value.83 Because each imputed value is created by randomly generating a residual and adding it to the imputation model's predicted value for the missing case, the five imputed values will be slightly different. Analysis of the five data sets produce five estimates of the treatment effect, which we will denote as β̂11, β̂12, β̂13, β̂14, β̂15. The overalltreatment effect is computed as the mean of the five estimates. The standard error is computed as function of the standard error of each estimate, and the variation in the estimates across the five replications. For more details, see Chapter 3.

     Missing Pretest Scores for Students Within Schools
For our implementation of multiple stochastic regression imputation, we used SAS's PROC MI to generate the imputed values, and SAS's PROC MIANALYZE to fit the analytical model to the data sets with the imputed values. One detail of our use of PROC MI to generate the imputed values is worthy of note. There is no way to fit a two-level hierarchical linear model (HLM) in PROC MI. Therefore, to approximate the two-level HLM model in our imputation model, we used fixed effects dummy variables for schools, in place of the random intercept terms that we used in the impact analysis models.

For the treatment group, we used PROC MI to fit an imputers model of the form:

treatment group imputers model

where Schj =1 if student is in school j, and = 0 otherwise. We fit a model of the same form to the data from control group members. PROC MI then generates predicted values, and rather than sampling a residual, generates a residual from a normal distribution with mean 0 and variance equal to the estimated variance of εij. The generated residual is added to the predicted value to obtain an imputed value, which we will denote as ŶTreat.Pre,ij +rijk if student is in treatment group, and as ŶControl.Pre.j + rjl*, if student is in control group.

As in single stochastic regression imputation, we define

Y.mriPre,ij = YmissPre,ij if YmissPre,ij is non-missing
  = ŶTreat.Pre,ij +rijk if YmissPre,ij is missing and the student is in treatment group
  = ŶControl.Pre,ij +rijl* if YmissPre,ij is missing and the student is in control group

where rjk is a randomly selected residual from the set rj, and rijl* is a randomly selected residual from the set rj*.

For each of the five data sets produced, the analytical model used to estimate the treatment effect was of the form:

YPost,ij00j1(Trtj) +β2(Female _ cenij) +β3(HiRisk _ cenij) +β4(Y.mriPre,ijij

The estimates from the five impact models were combined, as described in Chapter 3, to obtain the overall impact estimate and its standard error.

     Missing Pretest Scores for Entire Schools
As described for non-stochastic and single stochastic regression imputation, data were aggregated to the school level to produce school-level means and imputation and impact analyses were conducted on the school-level data sets. We used SAS's PROC MI to fit the imputer's model, and SAS's PROC MIANALYZE to fit the analytical models to estimate impacts. The analytical model to estimate the treatment effect was of the form:

YPost.j00j1(Trtj) +β2(Female _ cenj) +β3(HiRisk _ cenj) +β4(Y.mriPre.j) +εj

The estimates from the five impact models were combined, as described in Chapter 3, to obtain the overall impact estimate and its standard error.

Missing Post-test Scores for Students Within Schools An analogous imputation procedure to that described for missing pretest scores of students was used. For scenarios where pretest data were available, the analytical model used to estimate the treatment effect was of the form:

Y.mriPost,ij00j1(Trtj) +β2(Female _ cenij) +β3(HiRisk _ cenij) +β4(YPre,ij) +εij

For scenarios where pretest data were not available, the analytical model used was of the form:

Y.mriPost,ij00j1(Trtj) +β2(Female _ cenij) +β3(HiRisk _ cenij) +εij

The estimates from the five impact models were combined, as described in Chapter 3, to obtain the overall impact estimate and its standard error.

     Missing Post-test Scores for Entire Schools
An analogous imputation procedure to that described for missing pretest scores fpr entire schools was used. As before, data were aggregated to the school level to produce school-level means, and imputation and the impact analyses were conducted on the school-level data sets. We used SAS's PROC MI to fit the imputer's model, and SAS's PROC MIANALYZE to fit the analytical models to estimate impacts. For scenarios where pretest data were available, the analytical model used was of the form:

Y.mriPost.j01(Trtj) +β2(Female_cenj) +β3(HiRisk_cenj) +β4(YPre.j) +εj

For scenarios where pretest data were not available, the analytical model used was of the form:

Y.mriPost.j01(Trtj) +β2(Female_cenij) +β3(HiRisk_cenij) +εij

The estimates from the five impact models were combined, as described in Chapter 3, to obtain the overall impact estimate and its standard error.

Maximum Liklihood—EM Algorithm with Multiple Imputation
The EM algorithm with multiple imputation method was implemented in a manner very similar to that described for multiple stochastic regression imputation. The difference being that in the latter the imputed values were the predicted values from a regression model, and in the EM approach, the EM algorithm was used to obtain imputed values. In both approaches we generated five imputed data sets, and in both a random residual was added to each predicted value such that the imputed values in each of the five data sets would be slightly different from one another. Analysis of the five data sets produced five estimates of the treatment effect, which we denote as β̂11, β̂12,β̂13,β̂14, β̂15. The overalltreatment effect is computed as the mean of the five estimates. The standard error is computed as function of the standard error of each estimate, and the variation in the estimates across the five replications. For more details, see Chapter 3.

For the treatment group, we entered the following variables into the EM algorithm:
YmissPre,ij
Female _ cenij
HiRisk _ cenij
YPostij
Sch1, Sch2,...,Sch29

where Schj =1 if student is in school j, and = 0 otherwise. We separately entered data for control group members into the EM algorithm. The same variables were entered, except the school dummies corresponded to the control group schools. PROC MI used the EM algorithm to generate predicted values and added a randomly generated residual to the predicted value to obtain an imputed value. We denote the imputed value as ŶTreat.Pre,ij + rijk if student is in treatment group, and as ŶControl.Pre.j + rjl*, if student is in control group.

As in multiple stochastic regression imputation, we define

Y.emmiPre,ij = YmissPre,ij if YmissPre,ij is non-missing
  = ŶTreat.Pre,ij +rijk if YmissPre,ij is missing and the student is in treatment group
  = ŶControl.Pre,ij +rijl* if YmissPre,ij is missing and the student is in control group

where rjk is a randomly selected residual from the set rj, and rjl* is a randomly selected residual from the set rj*.

For each of the five data sets produced, the analytical model used to estimate the treatment effect was of the form:

YPost,ij00j1(Trtj) +β2(Female_cenij) +β3(HiRisk_cenij) +β4(Y.emmiPre,ij) +εij

The estimates from the five impact models were combined, as described in Chapter 3, to obtain the overall impact estimate and its standard error.

     Missing Pretest Scores for Entire Schools
As described for the regression imputation methods, data were aggregated to the school level to produce school-level means, and imputation and impact analyses were conducted on the school-level data sets. We used SAS's PROC MI to implement the EM algorithm, and SAS's PROC MIANALYZE to fit the analytical models to estimate impacts. The analytical model to estimate the treatment effect was of the form:

YPost.j 00j1(Trtj) +β2(Female_cenj) +β3(HiRisk_cenj) +β4(Y.emmiPre.j) +εij

The estimates from the five impact models were combined, as described in Chapter 3, to obtain the overall impact estimate and its standard error.

     Missing Post-test Scores for Students Within Schools
An analogous EM imputation procedure to that described for missing pretest scores of students was used. For scenarios where pretest data were available, the analytical model used to estimate the treatment effect was of the form:

Y.emmiPost,ij00j1(Trtj) +β2(Female_cenij) +β3(HiRisk_cenij) +β4(YPre,ij) +εij

For scenarios where pretest data were not available, the analytical model used was of the form:

Y.emmiPost,ij00j1(Trtj) +β2(Female_cenij) +β3(HiRisk_cenij) +εij

The estimates from the five impact models were combined, as described in Chapter 3, to obtain the overall impact estimate and its standard error.

     Missing Post-test Scores for Entire Schools
An analogous EM imputation procedure to that described for missing pretest scores for entire schools was used. As before, data were aggregated to the school level to produce school-level means, and imputation and impact analyses were conducted on the school-level data sets. We used SAS's PROC MI to fit the imputer's model, and SAS's PROC MIANALYZE to fit the analytical models to estimate impacts. For scenarios where pretest data were available, the analytical model used was of the form:

Y.emmiPost.j00j1(Trtj) +β2(Female_cenj) +β3(HiRisk_cenij) +β4(YPre,ij) +εij

For scenarios where pretest data were not available, the analytical model used was of the form:

Y.emmiPost.j00j1(Trtj) +β2(Female_cenj) +β3(HiRisk_cenj) +εij

The estimates from the five impact models were combined, as described in Chapter 3, to obtain the overall impact estimate and its standard error.

Fully-Specified Regression Models with Treatment/Covariate Interactions
     Missing Post-test Scores for Students or Entire Schools
To implement this approach, we calculated the sample centered value of the pretest score by subtracting the sample mean of the pretest from the pretest score for each student:

Y*Pre,ij = the mean of Y*Pre,ij

YsampCenPre,ij = Y*Pre,ij -Y*Pre,ij. The mean of YsampCenPre,ij is zero.

The analysis model is of the form:84

YmissPost,ij00j1(Trtj) +β2(Female_cenij) +β3(HiRisk_cenij) +β4(YsampCenPre,ij) +β5(*YsampCenPre,ij) +εij

and β̂1 is the estimate of the average treatment effect.85

Simple Weighting Approach
We use weighting to deal with missing post-test data only. Simple weighting can only be used when data are missing for students within schools, since it uses the non-missing cases in a school to represent the missing cases. With data missing for entire schools there are no non-missing to use.

     Missing Post-test Scores for Students
For this method, in each school, respondents are simply weighted up to the total number of students sampled from the school.

Let Nj be the number of students sampled in the jth school, and let nj, be the number respondents in the the jth school (i.e, the number of students with non-missing post-test scores). Within each school, each student with a non-missing post-test score is assigned a weight equal to wij = Nj/nj; each student with a missing post-test score is assigned a weight of 0. Thus, for each school, the sum of the student weights will equal the number of students selected in the sample from that schools (Nj). For example, if 60 students were sampled in school j and 40 students had non-missing post-test scores, each of the 40 students would be assigned a weight equal to 60/40. The sum of the weights over the 40 respondents equals the size of the original sample.

Using the WEIGHT statement, the following models were fit to the data using SAS PROC MIXED.86 For scenarios where pretest data were available, a weighted version of Model B was used to estimate the treatment effect, where the weight was set to wij(asdefined in above). For scenarios where pretest data were not available, a weighted version of Model A was used to estimate the treatment effect, where the weight was set to wij.

More Sophisticated Weighting Approach
     Missing Post-test Scores for Students or Entire Schools
Like the simple weighting approach described above, in this method we created weights for each respondent, then fit the same models as specified above to the complete cases, but applied the weights to the data using the weight statement in SAS PROC MIXED. The procedure for calculating the weights under the more sophisticated approach was as follows:

  1. Estimate a logit model of response as a function of (1) dummy variables for 59 of the schools, (2) sex and race/ethnicity, and (3) pretest.
  2. Using the model to compute estimated response probabilities for each student.
  3. Divide the entire sample—including both respondents and nonrespondents—into quintiles based on the estimated response probability.
  4. Compute the response rate (between 0 and 1) for each quintile.
  5. Set the weight wij for each student to the inverse of the response rate for all studentsin the same quintile. This effectively creates five different weights—one for all students in each quintile: w1, w2, w3, w4, and w5.

For scenarios where pretest data were available, a weighted version of Model B was used to estimate the treatment effect, where the weight was set to wij. For scenarios wherepretest data were not available, a weighted version of Model A was used to estimate the treatment effect, where the weight was set to wij.

Top

80 The constant C below was included as a multiplier in this equation to ensure that the unconditional variance of post-test scores would equal 1.
81 For a special analysis, we also generated missing data under three additional missing data rates—10 percent, 20 percent, and 30 percent, described subsequently in the section on Scenario III.
82 We conducted a set of simulations where the imputer's model included school random intercepts instead of school fixed effects. From this exercise, we found that the models with school fixed effects yielded more accurate standard error estimates than the models with school random effects. Therefore, in Section 4 and in Appendix E, we present results from the models that included school fixed effects.
83 The literature suggests that 5-10 imputations is adequate (see Rubin 1987, 1996 and Little & Rubin, 2002).
84 This method assumes the pretest is available, so only one model is specified.
85 Ordinarily, the fully specified regression model would have interaction terms between the treatment dummy variable and all of the baseline covariates in the model, not just some of them as shown here. Iinteractions of the treatment dummy and the female and risk covariates were not entered into the model here because in our synthetic data impact does not vary with these factors.
86 In estimating the standard errors, we did not account either for the variation in the weights across sample members or for the sampling variability in the model estimates used to compute the weights. In principal, failure to account for these sources of variation should lead us to underestimate the standard error of the treatment effect. However, our simulation results suggest that the size of the bias in the estimated standard errors is very small (see third figure from each exhibit in Chapter 4). Therefore, while these corrections may be generally advisable with weighted data, we concluded that they were unnecessary for these simulations.