Skip Navigation
Technical Methods Report: The Estimation of Average Treatment Effects for Clustered RCTs of Education Interventions

NCEE 2009-0061
August 2009

Chapter 7: Summary and Conclusions

This paper has examined the estimation of two-stage clustered RCT designs in education research using the Neyman causal inference framework that underlies experiments. The key distinction between the considered causal models is whether potential treatment and control group outcomes are considered to be fixed for the study population (the FP model) or randomly selected from a vaguely-defined super-population (the SP model).

In the FP model, the only source of randomness is treatment status, and a clustered design results only because students in the same cluster share the same treatment status. The relevant impact parameter for this model is the average treatment effect for those in the study sample; thus, the impact results are internally valid only. The asymptotic variance for the FP model (that was derived in this paper) can be estimated using a GEE estimator assuming an independent working correlation structure. Two weighting options for this model are (1) to weight each student equally (the OLS approach) or (2) to weight each cluster equally (to estimate ATEs for the average cluster in the sample). The FP variance estimators are likely to be conservative, however, because they ignore precision gains from difficult-to-estimate variance terms that represent the extent to which treatment effects vary and co-vary across students in the same cluster. Thus, in theory, the FP estimators could yield more precise ATE estimates than the SP estimators, but it is difficult to realize these precision gains in practice.

In the SP model, cluster- and student-level potential outcomes are considered to be randomly sampled from respective super-population distributions. In this framework, the relevant ATE parameter is the intervention effect for the average cluster in the super-population. Thus, impact findings are assumed to generalize outside the study sample, although it is often difficult to precisely define the study universe. For estimating the SP model, the paper discussed key features of several feasible GLS estimators (ML, REML, ANOVA, and GEE estimators) assuming an exchangeable random effects error structure. For these estimators, clusters are weighted by the inverses of their variances, and the variability of these weights lies between the variability of the weights under the two FP weighting schemes.

Using data from five recent large-scale clustered RCTs in the education area, the empirical analysis estimated ATEs and their standard errors using the considered estimators. For all five studies, the considered estimators yield consistent findings concerning statistical significance. However, although the estimated impacts are similar across the estimators, the standard errors (and hence, p-values) differ more across the estimators. This suggests that in particular studies, policy conclusions could differ using the various estimators.

The choice of the primary estimation method and cluster-level weighting scheme should best fit evaluation research questions and objectives, and should be specified and justified in the analysis protocols. However, there might not always be a scientific basis for making these benchmark choices (that is, there might not be a "true" underlying statistical model for the study). Thus, a key recommendation from this paper is that education researchers consider testing the sensitivity of their benchmark impact findings using alternative estimation methods, rather than relying solely on the methods with which they are most comfortable. These sensitivity analyses could be important for ruling out the possibility that the impact findings are driven by specific distributional assumptions about the data and asymptotic results. Furthermore, it is recommended that findings from sensitivity analyses be reported in study appendixes, that attempts be made to explain discrepancies between sensitivity and benchmark analysis findings, and that the robustness of results be reflected in the study conclusions.

Researchers currently most often report impact findings using the SP framework based on REML or ML methods. Results in this paper suggest that, in the sensitivity analysis, impact estimates could also be estimated using other methods such as the balanced-design, GEE, and FP estimators. The ANOVA method is another approach that could be used more often in education research.

Finally, the choice of whether to adopt the FP or the SP framework is a difficult philosophical issue. In practice, the two methods will tend to blur, however, because standard estimation procedures do not account for precision gains from the FP model, and the empirical results presented in this paper suggest that the FP and SP models yield similar impact findings. Furthermore, the two approaches blur under balanced designs. Nonetheless, researchers should understand the assumptions underlying the SP and FP approaches and their implications for generalizing and interpreting the impact findings.

Top