Skip Navigation
Technical Methods Report: What to Do When Data Are Missing in Group Randomized Controlled Trials

NCEE 2009-0049
October 2009

Appendix B: Resources for Using Multiple Imputation

In the section titled “Multiple Stochastic Regression Imputation,” we provided some guidance on how to use multiple imputation to address missing data. Before implementing MI, or any other method to address missing data, we would recommend additional reading, such as Allison (2002) and articles by the statisticians who have developed and refined MI methods (e.g., Rubin, 1996; Schafer, 1999). However, in the end, researchers need to know how to use available software to implement MI should they choose that option for dealing with missing data. Therefore, we provide some guidance and references to other resources that may be helpful.

As shown earlier in this report, specialized software or MI-specific procedures in general purpose statistical software is not required to use MI methods. However, programming one's own multiple imputation algorithm is considerably more challenging than the programming required to specify analysis models in most evaluations. Therefore, specialized MI software may be useful for people who expect to conduct MI regularly. Furthermore, MI-specific procedures in the software that education researchers commonly use can make MI an easier choice in education-related RCTs.

In this section we list some specialized software packages for conducting MI, and we also list some MI-specific procedures in general purpose statistical software that may make MI easier for users to implement. For a comprehensive treatment of the software packages available to implement MI, see Horton & Kleinman (2007).79 We conclude with a more extensive example of how to conduct MI in SAS for purposes of illustration. We have selected SAS for this example—without recommending it over other alternatives—because it is a commonly used general-purpose statistical package, and because it can handle the imputation, estimation, and combination steps all in a single package.

Software for Multiple Imputation Specialized, stand-along software has been developed for implementing MI. Some examples include:

Some statistical packages commonly used in education research also have MI procedures, modules, or options, while others do not. Some of the software packages used by education researchers include:

  • Stata. A multiple imputation procedure developed by Patrick Royston can be installed directly through Stata.
  • SPSS. SPSS Inc offers an add-on package named PASW Missing Values that will implement MI. The SPSS base package does not include canned routines for conducting MI.
  • HLM. HLM can be used to analyze multiple data sets and can aggregate the results in an MI framework, provided that the multiple data sets are created by the user beforehand. http://www.ssicentral.com/hlm/example6-1.html
  • SPlus. There are several Splus libraries available that contain functions for multiple imputation. These include:
  • R. Most of the SPlus libraries listed above are also available for R. For more information, see http://cran.r-project.org/web/views/SocialSciences.html
  • SAS. Specific SAS procedures have been developed to facilitate MI. See the example below.

An Example of MI Using SAS
SAS includes procedures that allow the user to (1) generate k multiple imputed values for each missing value in the data—which yields k different data sets—(2) estimate impacts for each imputed data set using one's preferred regression procedure (e.g., PROC MIXED for mixed, hierarchical, or multi-level modeling), and (3) combine the estimates across imputations. The last step will produce estimates of the coefficients in the model, including the treatment effect, and estimates of their standard errors.

Suppose we are conducting an RCT of an educational intervention, and 60 schools are randomly assigned—30 to treatment (T=1) and 30 to control (T=0). Furthermore, suppose that we want to estimate the average impacts of the intervention on three student outcomes, Y1, Y2, and Y3, controlling for four student-level background variables, X1, X2, X3, and X4, and two school-level descriptive variables, S1 and S2. The sample includes 1,000 students, but data for some students and some variables are missing. Suppose we plan to estimate impacts using a two-level model, where level 1 is the student-level model and level 2 is the school level model. Proc MI does not have the capability to explicitly fit at two-level “imputer's model”, but we can approximate the two-level structure by adding 59 dummy variables corresponding to the 60 schools (less 1) to the imputers model. Let us represent those dummy variables as D1, D2, …, D59. We cannot simultaneously enter the school level variables S1 and S2 and the 59 dummy variables, so the variables S1 and S2 will not be used in the imputer's model, but their effects will be captured in the dummy variables.

In this context, MI can be used to address missing data in three steps:

Step 1 – Create Imputed Data

proc mi data=data1 noprint out=data2 seed=37851 NIMPUTE=5;
          var     T Y1 Y2 Y3 X1 X2 X3 X4 D1 - D59;
run;

One can use any number for the value of “seed.” If we omit the seed value, SAS will generate are random number for use as the seed value. By explicitly specifying a seed value, as shown above, we can replicate our results if we re-run the same program at a later time. The seed's value does not matter; it is only a starting point for a procedure with a common end result using any seed.

This procedure reads the input data set data1 and creates an output data set data2 with 5 observations for every observation in data1. Data2 contains a variable _Imputation_ that equals 1, 2, 3, 4, or 5. Non-missing values for each variable are repeated across imputations; missing values are replaced with imputations based on a model that uses all of the variables in the var statement above.

Step 2 – Estimate the Model (e.g., Y1 only)

proc mixed data=data2;
          class school; /* school is a variable that uniquely identifies each school
*/
          Model Y1 = T X1 X2 X3 S1 S2;
          by _Imputation_;
          random intercept/type=un sub=school;
          ods output SolutionF=data3a CovB=data3b;
run;

For each of the five imputed data sets, this procedure specifies a linear, multi-level model to estimate the average treatment effect on the first outcome variable (Y1). The random option allows the intercept to vary randomly across schools.

Step 3 – Combine the Estimates

proc mianalyze parms=data3a covb=data3b edf=994; /* 994 = 1000 students –
6 X variables */
          var     T X1 X2 X3 S1 S2;
run;

This procedure combines the five sets of estimates. The output will include an estimate of the average treatment effect (coefficient on T) and its standard error.

Top

79 This paper is available online at http://maven.smith.edu/~nhorton/muchado.pdf. The appendix showing code and output is available online at http://www.math.smith.edu/muchado-appendix.pdf.