- 1. Overview and Guidance
- 2. Randomized Controlled Trials (RCTs) in Education and the Problem of Missing Data
- 3. Selected Techniques for Addressing Missing Data in RCT Impact Analysis
- 4. Testing the Performance of Selected Missing Data Methods
- References
- Exhibits
- Appendix A: Missing Data Bias as a Form of Omitted Variable Bias
- Appendix B: Resources for Using Multiple Imputation
- Appendix C: Specifications for Missing Data Simulations
- Appendix D: Full Set of Simulation Results
- Appendix D: Tables
- Appendix E: Standards for Judging the Magnitude of the Bias for Different Missing Data Methods
- PDF & Related Info

In the section titled “Multiple Stochastic Regression Imputation,” we provided some guidance on how to use multiple imputation to address missing data. Before implementing MI, or any other method to address missing data, we would recommend additional reading, such as Allison (2002) and articles by the statisticians who have developed and refined MI methods (e.g., Rubin, 1996; Schafer, 1999). However, in the end, researchers need to know how to use available software to implement MI should they choose that option for dealing with missing data. Therefore, we provide some guidance and references to other resources that may be helpful.

As shown earlier in this report, specialized software or MI-specific procedures in general purpose statistical software is not required to use MI methods. However, programming one's own multiple imputation algorithm is considerably more challenging than the programming required to specify analysis models in most evaluations. Therefore, specialized MI software may be useful for people who expect to conduct MI regularly. Furthermore, MI-specific procedures in the software that education researchers commonly use can make MI an easier choice in education-related RCTs.

In this section we list some specialized software packages for conducting MI, and
we also list some MI-specific procedures in general purpose statistical software
that may make MI easier for users to implement. For a comprehensive treatment of
the software packages available to implement MI, see Horton & Kleinman (2007).^{79} We conclude with a more extensive example
of how to conduct MI in SAS for purposes of illustration. We have selected SAS for
this example—without recommending it over other alternatives—because it is a commonly
used general-purpose statistical package, and because it can handle the imputation,
estimation, and combination steps all in a single package.

**Software for Multiple Imputation** Specialized, stand-along software
has been developed for implementing MI. Some examples include:

**IVEware.**Developed by T. E. Raghunathan, Peter W. Solenberger, and John Van Hoewyk at the University of Michigan. It is available for download at www.isr.umich.edu/src/smp/ive/.**Amelia II.**Developed by James Honaker, Gary King, and Matthew Blackwell at Harvard University. It is available for download at http://gking.harvard.edu/amelia/.**SOLAS.**SOLAS is a commercial package that can be purchased at http://www.statsol.ie/html/solas/solas_home.html.

Some statistical packages commonly used in education research also have MI procedures, modules, or options, while others do not. Some of the software packages used by education researchers include:

**Stata.**A multiple imputation procedure developed by Patrick Royston can be installed directly through Stata.**SPSS.**SPSS Inc offers an add-on package named PASW Missing Values that will implement MI. The SPSS base package does not include canned routines for conducting MI.**HLM.**HLM can be used to analyze multiple data sets and can aggregate the results in an MI framework, provided that the multiple data sets are created by the user beforehand. http://www.ssicentral.com/hlm/example6-1.html**SPlus.**There are several Splus libraries available that contain functions for multiple imputation. These include:- Missing Data Library, built-in in Splus 6.0 and higher
- Hmisc Library, for more information see http://www.multiple-imputation.com/
- MICE. For more information see http://www.multiple-imputation.com/
- NORM, CAT, MIX, and PAN. Developed by Joe Schafer at Penn State University. It is available for download at http://www.stat.psu.edu/~jls/misoftwa.html#top
**R.**Most of the SPlus libraries listed above are also available for R. For more information, see http://cran.r-project.org/web/views/SocialSciences.html**SAS.**Specific SAS procedures have been developed to facilitate MI. See the example below.

**An Example of MI Using SAS**

SAS includes procedures that allow the user to (1) generate *k* multiple
imputed values for each missing value in the data—which yields k different data
sets—(2) estimate impacts for each imputed data set using one's preferred regression
procedure (e.g., PROC MIXED for mixed, hierarchical, or multi-level modeling), and
(3) combine the estimates across imputations. The last step will produce estimates
of the coefficients in the model, including the treatment effect, and estimates
of their standard errors.

Suppose we are conducting an RCT of an educational intervention, and 60 schools are randomly assigned—30 to treatment (T=1) and 30 to control (T=0). Furthermore, suppose that we want to estimate the average impacts of the intervention on three student outcomes, Y1, Y2, and Y3, controlling for four student-level background variables, X1, X2, X3, and X4, and two school-level descriptive variables, S1 and S2. The sample includes 1,000 students, but data for some students and some variables are missing. Suppose we plan to estimate impacts using a two-level model, where level 1 is the student-level model and level 2 is the school level model. Proc MI does not have the capability to explicitly fit at two-level “imputer's model”, but we can approximate the two-level structure by adding 59 dummy variables corresponding to the 60 schools (less 1) to the imputers model. Let us represent those dummy variables as D1, D2, …, D59. We cannot simultaneously enter the school level variables S1 and S2 and the 59 dummy variables, so the variables S1 and S2 will not be used in the imputer's model, but their effects will be captured in the dummy variables.

In this context, MI can be used to address missing data in three steps:

**Step 1 – Create Imputed Data**

proc mi data=data1 noprint out=data2 seed=37851 NIMPUTE=5;

var T
Y1 Y2 Y3 X1 X2 X3 X4 D1 - D59;

run;

One can use any number for the value of “seed.” If we omit the seed value, SAS will generate are random number for use as the seed value. By explicitly specifying a seed value, as shown above, we can replicate our results if we re-run the same program at a later time. The seed's value does not matter; it is only a starting point for a procedure with a common end result using any seed.

This procedure reads the input data set *data1* and creates an output data
set *data2* with 5 observations for every observation in *data1*.
*Data2* contains a variable _*Imputation*_ that equals 1, 2, 3, 4,
or 5. Non-missing values for each variable are repeated across imputations; missing
values are replaced with imputations based on a model that uses all of the variables
in the *var* statement above.

**Step 2 – Estimate the Model (e.g., Y1 only)**

*proc mixed data=data2;
class school; /* school
is a variable that uniquely identifies each school
*/
Model Y1 = T X1 X2 X3
S1 S2;
by _Imputation_;
random intercept/type=un
sub=school;
ods output SolutionF=data3a
CovB=data3b;
run;*

For each of the five imputed data sets, this procedure specifies a linear, multi-level
model to estimate the average treatment effect on the first outcome variable (*Y1*).
The *random* option allows the intercept to vary randomly across schools.

**Step 3 – Combine the Estimates**

*proc mianalyze parms=data3a covb=data3b edf=994; /* 994 = 1000 students –
6 X variables */
var T
X1 X2 X3 S1 S2;
run;*

This procedure combines the five sets of estimates. The output will include an estimate of the average treatment effect (coefficient on T) and its standard error.