Technical Methods Report: Statistical Power for Regression Discontinuity Designs in Education Evaluations

NCEE 2008-4026
August 2008

Chapter 8: Summary and Conclusions

This paper has examined theoretical and empirical issues related to the statistical power of impact estimates under clustered RD designs that could be conducted in a school setting. The theoretical framework is grounded in the causal inference and HLM modeling literature, and the empirical work focused on group-based designs that are commonly used to test the effects of education interventions on student’s standardized test scores.

The main conclusion is that much larger samples are required under RD than RA designs to produce rigorous impact estimates. This occurs because the large RD design effects that have been found previously for nonclustered designs carry over to most multilevel clustered designs that are typically used in education research. This pattern holds for a wide range of score distributions and score cutoff values.

These findings have important implications for the viability of using RD designs for new evaluations in the education field, due to the high cost of recruiting study schools, implementing interventions, and collecting data. Based on resources that are typically devoted to large-scale impact studies by the U.S. Department of Education and other funders, the results suggest that RD designs where schools are assigned to treatment or control status are likely to be feasible only for interventions that can have relatively large effects—0.33 standard deviations or more. RD designs appear to be more viable for lessclustered designs where classrooms or students are assigned directly to a research condition.

A key finding is that clustered RD designs can yield impact findings with sufficient levels of precision only if detailed baseline data—and in particular, pre-intervention measures of the outcomes—are collected and used in the regression models to increase R² values. Furthermore, RD designs will typically have sufficient power for detecting impacts at the pooled level only, but not for population subgroups; this problem is more severe for RD than RA designs.

In conclusion, although well-designed RD designs can yield unbiased impact estimates, they cannot necessarily be viewed as a substitute for experimental designs in the education field. School sample sizes typically need to be about three to four times larger under RD than RA designs to achieve impact estimates with the same levels of precision. Furthermore, RD designs yield impact findings that typically pertain to a narrower population (those with scores near the cutoff) than those from experiments (those with all scores), and rely on the validity of critical modeling assumptions that are not required under the RA design. The desirability of using RD designs will depend on the point of treatment assignment, the availability of pretest data, and key research questions.

Top