Skip Navigation
Technical Methods Report: The Estimation of Average Treatment Effects for Clustered RCTs of Education Interventions

NCEE 2009-0061
August 2009

Chapter 6: Empirical Analysis

This chapter presents ATE estimates and their standard errors using five published large-scale RCTs that were funded by the Institute of Education Sciences (IES) at the U.S. Department of Education (ED) and several foundations. These RCTs tested the effects of a wide range of education interventions, including mentoring programs for new teachers (Glazerman et al. 2008), early elementary school math curricula (Agodini et al. 2009), the use of selected computer software in the classroom (Dynarski et al. 2007), selected reading comprehension interventions (James-Burdumy et al. 2009), and Teach for America (Decker et al. 2004). Across the RCTs, random assignment was conducted at either the school or teacher (classroom) level primarily in low-performing school districts, and the key outcome measures were math or reading test scores of elementary school students. Appendix Table B.1 provides information for each study.

All studies (except for the Reading Comprehension study) report impact findings using a SP framework (using HLM models with baseline covariates), although it cannot be determined which specific estimation and optimization methods were used for the analyses. This chapter discusses findings from a re-analysis of the RCT data using the estimation methods considered above for the FP and SP models. The focus is on models that include baseline covariates. Using study documentation, the choice of baseline covariates (including blocking indicators), the construction of the outcome measures, and the treatment of missing data were as similar as possible to those used by the authors of the study reports. For comparable models, the impact results reported below are similar to those presented in the published reports.

SAS was used to estimate the GEE, balanced design, REML, and ML models, because research has shown that the statistical packages considered in this paper yield similar estimates for common model specifications and optimization routines (West et al. 2007, Shah 1998). To keep the presentation manageable, the ML and REML estimates were obtained using the Newton-Raphson algorithm. The SAS code that was used to estimate the models is displayed in the footnotes to Table 6.2 below. The permutation tests were conducted using SAS programs written by the author, where permutation distributions were estimated from 10,000 reallocations of cluster means to the pseudo-treatment and control groups (because the number of possible allocations was too large to delineate for these studies). The ANOVA estimates were also obtained using SAS programs written by the author.7

In what follows, information is first presented for each study on cluster-level sample sizes and weights for the FP and SP models. This information is helpful for interpreting the impact findings, which are presented second.

Weights for the Finite-Population and Super-Population Models

As discussed, a key difference between the FP and SP models involves how clusters (schools or classrooms) are weighted in the analysis. In the FP models, clusters are either weighted by their sample sizes or equally, whereas in the SP model, clusters are weighted by the inverses of their variances. The extent to which ATE results differ across the weighting schemes will depend on the variability of cluster sample sizes, ICC values for the outcome variables, and the relationship between cluster-level impacts and cluster sample sizes.

The top panel of Table 6.1 shows that cluster sample sizes vary for all five studies, but more so for some studies than others. For example, the interquartile range of cluster sizes is about 7 students for the classroom-based Teach for America and Educational Technologies studies, but is 30 students for the school-based Reading Comprehension study. The finding that cluster sample sizes vary within each study suggests that cluster-level weights always differ for the two FP models. There are also differences across the studies in ICC values (Table 6.1). These intraclass correlations range from 0.06 to 0.12 for models that include baseline covariates and from 0.13 to 0.29 for models that exclude covariates.

Finally, the variability of the weights for the SP models lies between the variability of the weights for the two FP models (Table 6.1). For instance, for the Math Curriculum study, the interquartile range for the SP weights for the REML model is 4 (bottom panel of Table 6.1), compared to 14 for the FP model where clusters are weighted by their sample sizes (top panel of Table 6.1), and 0 for the FP model where clusters are weighted equally.

Impact Findings

For all studies, the considered FP and SP estimators yield consistent findings concerning the statistical significance of the ATE estimates (Table 6.2). The estimators show that (1) elementary school students taught by Teach for America teachers performed significantly better on math achievement tests than those taught by traditional teachers, (2) the use of selected software products in the classroom did not improve first graders’ math test scores, (3) the offer of teacher induction programs for beginning teachers did not improve math test scores for second to sixth grade students, (4) the Saxon or Math Expressions math curriculum produced significantly higher fifth grade student math test scores than the other tested math curricula, and (5) the Reading for Knowledge reading curriculum produced significantly lower fifth grade student reading scores than the control (status quo) reading curriculum offered in the study schools.

For each study, the ATE impact estimates vary by less than 0.02 or 0.03 standard deviations across the eight estimators (Table 6.2). For example, the impact estimates in effect size units range from 0.261 to 0.273 for the Math Curriculum study, from 0.126 to 0.129 for the Teach for America study, and from -0.147 to -0.159 for the Reading Comprehension study.

The estimated standard errors (and p-values), however, range somewhat more across the eight estimators than the ATE point estimates (Table 6.2). For example, standard errors range from 0.038 to 0.075 for the Reading Comprehension study, 0.035 to 0.050 for the Teacher Induction study, and 0.478 to 0.766 for the Educational Technologies study. The finding that the various consistent estimators yield more variable estimates of standard errors than regression coefficients is a pattern that has often been found in the literature for observational studies.

Findings for the SP Estimators. On the basis of the empirical findings and the theory from above, the SP estimators can be divided into two main groups. The first group includes the ANOVA and REML estimators that both account for the loss in degrees of freedom in the variance estimates due to the regression parameters. Across the five studies, these two estimators yield identical ATE impact estimates, and standard errors that differ by at most .003 standard deviations (Table 6.2). The similarity of the ANOVA and REML findings is consistent with Baltagi and Chang (1994), who found using simulations that the ANOVA method performs well for random effects models. Thus, there is reason for education researchers to consider using the ANOVA estimator more often in RCTs.

The second group of SP estimators includes the model-based and empirical sandwich GEE estimators and the ML estimator. Across the five studies, these three estimators yield ATE impact estimates that differ from each other by less than .002 standard deviations, and standard errors that typically differ from each other by less than .005 standard deviations (Table 6.2). The similarity of estimates for the two GEE estimators suggests that the exchangeable error structure is appropriate for the data. The GEE and ML methods produce smaller standard errors than the REML and ANOVA methods (Table 6.2). This finding is expected for the ML method, which does not adjust for the degrees of freedom loss due to the estimation of the regression parameters.

Finally, the simple balanced design method produces impact and standard error estimates that are consistent with those from the other SP methods, even though this estimator does not account for unbalanced cluster sizes (Table 6.2). Thus, there is good reason to use this simple between-cluster estimator to check the robustness of study findings obtained using the other more complex methods.

Findings for the FP Estimators. Empirical results for the two FP models are displayed in the top panels of Table 6.2 and labeled as "Model 1a" and "Model 1b." Differences in the ATE impact estimates for these two FP models range from 0 to .035 standard deviations across the studies, because of differences in weighting schemes. The differences are most pronounced for the Educational Technologies and Teacher Induction studies where the estimated impacts are not statistically significant.

The ATE point estimates for the FP and SP models typically differ by less .005 standard deviations for the three studies with statistically significant impact estimates (the Teach for America, Math Curriculum, and Reading Comprehension studies; Table 6.2). Furthermore, across all five studies, the standard error estimates for the FP models are similar to each other and to those for the empirical sandwich GEE estimator for the SP model (Table 6.2); the pairwise differences in standard errors are all less than .007 standard deviations. However, as discussed, the standard error estimates for the FP models are conservative, because they ignore precision gains from the difficult-to-estimate Sτ2 terms in (6) and (12).

Finally, for FP Model 1b, the permutation and parametric hypothesis tests yield similar p-values (Table 6.2). For example, the respective p-values are .005 and .008 for the Teach for America study, .766 and .738 for the Educational Technologies study, and .000 and .007 for the Reading Comprehension study. Thus, the normality assumption underlying the parametric tests appears to be validated using the nonparametric methods, which are much more computationally burdensome.

Top

7 The Proc Panel procedure in SAS does not perform the SA ANOVA method that was discussed above, but uses variants of this procedure (which produce results consistent to those presented in this paper).