Technical Methods Report: Statistical Power for Regression Discontinuity Designs in Education Evaluations

NCEE 2008-4026
August 2008

Chapter 1: Introduction

Regression discontinuity (RD) designs are increasingly being used by researchers to obtain unbiased impact estimates of education-related interventions. These designs are applicable when a continuous "scoring" rule is used to assign the intervention to study units (for example, school districts, schools, or students). Units with scores below a pre-set cutoff value are assigned to the treatment group and units with scores above the cutoff value are assigned to the comparison group, or vice versa. For example, Jacob and Lefgren (2004) examined the effects of attending summer school on the outcomes of New York City students using the rule that only students with standardized test scores below a cutoff value were required to attend summer school. As another example, the design for the National Evaluation of Early Reading First (ERF) (Jackson et al. 2007) was based on an independent reviewer scoring process where grantees with the highest application scores were awarded ERF grants to improve local preschools. As a final example, Ludwig and Miller (2007) exploited the variation in Head Start funding across counties to examine the program’s effects on schooling and health. Cook (2008), Imbens and Lemieux (2008), and Shadish et al. (2002) provide reviews of the RD design.

Under well-designed RD designs, the treatment assignment rule is fully observed and can be modeled to yield unbiased impact estimates. A regression line (or curve) is fit in the outcome-score plane for the treatment group and similarly for the comparison group, and differences in the intercepts of these lines is the impact estimate. An impact occurs if there is a "discontinuity" in the two regression lines at the cutoff score. Because the selection rule is fully known under the RD design, selection bias issues tend to be less problematic under the RD design than under other non-experimental designs.

The literature suggests that the RD design might be a suitable alternative to a random assignment (RA) design when an experiment is not feasible (Cook 2008). RD designs tend to interfere less with normal program operations than RA designs, because treatment assignments for the study population are determined by rules developed by program staff or policymakers rather than randomly. Thus, treatments can be targeted to those who normally receive them (for evaluations of existing interventions) or to those who are deemed likely to benefit most from them (for evaluations of new interventions). Thus, RD designs may be easier to "sell" to program staff and participants, which could facilitate efforts to recruit study sites.

A major drawback of the RD design relative to the RA design, however, is that much larger sample sizes are typically required to achieve impact estimates with the same level of statistical power. If the score variable is normally distributed and centered on the cutoff, Goldberger (1972) demonstrated that for a nonclustered design, the sample under a RD design must be 2.75 times larger than for a corresponding experiment to achieve the same level of statistical precision. Cappelleri et al. (1994) extended this work to allow for a wider range of cutoff values. The reduction in precision in the RD design arises due to the substantial correlation, by construction, between the treatment status and score variables that are included in the regression models; this correlation is not present under the RA design.

This paper extends the work of Goldberger (1972) and Cappelleri et al. (1994) by addressing two main research questions: (1) What is the statistical power of RD designs under clustered (group-based) designs that are typically used in impact evaluations of education interventions, and (2) When are RD designs in a school setting feasible from a cost perspective?

The paper examines commonly-used clustered designs where groups (such as districts, schools, or classrooms) are assigned to a research status. Schochet (2008) and Bloom et al. (2005a) demonstrate that relatively large numbers of schools must be sampled under clustered RA designs (for example, about 60 if pretests are available) to yield impact estimates with adequate levels of precision. Because of additional precision losses under RD designs, statistical power is critical for assessing whether RD designs can be a viable alternative to RA designs in the education field. Although there is a large literature on appropriate methods for analyzing data under RD designs (see, for example, Imbens and Lemieux 2008), much less attention has been paid to examining statistical power under RD designs.

This paper builds on the literature in several other ways. It examines statistical power under RD designs that is anchored in the causal inference and hierarchical linear modeling (HLM) literature. The paper also examines statistical power for a wider range of score distributions than have been explored previously, and for both sharp RD designs (where all units comply with their treatment assignments) and fuzzy RD designs (which allow for noncompliers). In addition, the paper discusses power implications of including additional baseline covariates in the regression models, and criteria for determining the appropriate range of scores for the study sample. Finally, the paper uses the theoretical formulas and empirically-based parameter assumptions to calculate appropriate sample sizes for alternative RD designs. These estimates can serve as a guide for future RD designs in the education field.

The empirical analysis focuses on achievement test scores of elementary school and preschool students in low-performing school districts. The focus is on test scores due to the accountability provisions of the No Child Left Behind Act of 2001, and the ensuing federal emphasis on testing interventions to improve reading and mathematics scores of young students.

The rest of this paper is in seven chapters. Chapter 2 discusses how to measure statistical power, and Chapter 3 discusses the considered clustered designs. In Chapter 4, assuming that student-level data are aggregated to the group level, I discuss the theory underlying the RD and RA designs, variance calculations, and RD design effects. In Chapter 5, the analysis is extended to multilevel models where the data are analyzed at the student level, and in Chapter 6, I briefly discuss the appropriate range of scores for the study sample. Chapter 7 discusses empirical results and Chapter 8 presents conclusions.

Top