In the section titled “Multiple Stochastic Regression Imputation,” we provided some guidance on how to use multiple imputation to address missing data. Before implementing MI, or any other method to address missing data, we would recommend additional reading, such as Allison (2002) and articles by the statisticians who have developed and refined MI methods (e.g., Rubin, 1996; Schafer, 1999). However, in the end, researchers need to know how to use available software to implement MI should they choose that option for dealing with missing data. Therefore, we provide some guidance and references to other resources that may be helpful.
As shown earlier in this report, specialized software or MI-specific procedures in general purpose statistical software is not required to use MI methods. However, programming one's own multiple imputation algorithm is considerably more challenging than the programming required to specify analysis models in most evaluations. Therefore, specialized MI software may be useful for people who expect to conduct MI regularly. Furthermore, MI-specific procedures in the software that education researchers commonly use can make MI an easier choice in education-related RCTs.
In this section we list some specialized software packages for conducting MI, and we also list some MI-specific procedures in general purpose statistical software that may make MI easier for users to implement. For a comprehensive treatment of the software packages available to implement MI, see Horton & Kleinman (2007).79 We conclude with a more extensive example of how to conduct MI in SAS for purposes of illustration. We have selected SAS for this example—without recommending it over other alternatives—because it is a commonly used general-purpose statistical package, and because it can handle the imputation, estimation, and combination steps all in a single package.
Software for Multiple Imputation Specialized, stand-along software has been developed for implementing MI. Some examples include:
Some statistical packages commonly used in education research also have MI procedures, modules, or options, while others do not. Some of the software packages used by education researchers include:
An Example of MI Using SAS
SAS includes procedures that allow the user to (1) generate k multiple
imputed values for each missing value in the data—which yields k different data
sets—(2) estimate impacts for each imputed data set using one's preferred regression
procedure (e.g., PROC MIXED for mixed, hierarchical, or multi-level modeling), and
(3) combine the estimates across imputations. The last step will produce estimates
of the coefficients in the model, including the treatment effect, and estimates
of their standard errors.
Suppose we are conducting an RCT of an educational intervention, and 60 schools are randomly assigned—30 to treatment (T=1) and 30 to control (T=0). Furthermore, suppose that we want to estimate the average impacts of the intervention on three student outcomes, Y1, Y2, and Y3, controlling for four student-level background variables, X1, X2, X3, and X4, and two school-level descriptive variables, S1 and S2. The sample includes 1,000 students, but data for some students and some variables are missing. Suppose we plan to estimate impacts using a two-level model, where level 1 is the student-level model and level 2 is the school level model. Proc MI does not have the capability to explicitly fit at two-level “imputer's model”, but we can approximate the two-level structure by adding 59 dummy variables corresponding to the 60 schools (less 1) to the imputers model. Let us represent those dummy variables as D1, D2, …, D59. We cannot simultaneously enter the school level variables S1 and S2 and the 59 dummy variables, so the variables S1 and S2 will not be used in the imputer's model, but their effects will be captured in the dummy variables.
In this context, MI can be used to address missing data in three steps:
Step 1 – Create Imputed Data
proc mi data=data1 noprint out=data2 seed=37851 NIMPUTE=5;
var T
Y1 Y2 Y3 X1 X2 X3 X4 D1 - D59;
run;
One can use any number for the value of “seed.” If we omit the seed value, SAS will generate are random number for use as the seed value. By explicitly specifying a seed value, as shown above, we can replicate our results if we re-run the same program at a later time. The seed's value does not matter; it is only a starting point for a procedure with a common end result using any seed.
This procedure reads the input data set data1 and creates an output data set data2 with 5 observations for every observation in data1. Data2 contains a variable _Imputation_ that equals 1, 2, 3, 4, or 5. Non-missing values for each variable are repeated across imputations; missing values are replaced with imputations based on a model that uses all of the variables in the var statement above.
Step 2 – Estimate the Model (e.g., Y1 only)
proc mixed data=data2;
class school; /* school
is a variable that uniquely identifies each school
*/
Model Y1 = T X1 X2 X3
S1 S2;
by _Imputation_;
random intercept/type=un
sub=school;
ods output SolutionF=data3a
CovB=data3b;
run;
For each of the five imputed data sets, this procedure specifies a linear, multi-level model to estimate the average treatment effect on the first outcome variable (Y1). The random option allows the intercept to vary randomly across schools.
Step 3 – Combine the Estimates
proc mianalyze parms=data3a covb=data3b edf=994; /* 994 = 1000 students –
6 X variables */
var T
X1 X2 X3 S1 S2;
run;
This procedure combines the five sets of estimates. The output will include an estimate of the average treatment effect (coefficient on T) and its standard error.