Technical Methods
The National Center for Education Evaluation and Regional Assistance (NCEE) conducts unbiased large-scale evaluations of education programs and practices supported by federal funds; provides research-based technical assistance to educators and policymakers; and supports the synthesis and the widespread dissemination of the results of research and evaluation throughout the United States.
In support of this mission, NCEE promotes methodological advancement in the field of education evaluation through investigations involving analyses using existing data sets and explorations of applications of new technical methods, including cost-effectiveness of alternative evaluation strategies. The results of these methodological investigations are published as commissioned, peer reviewed papers, under the series title, Technical Methods Reports, posted on the NCEE website at http://ies.ed.gov/ncee/pubs/. These reports are specifically designed for use by researchers, methodologists, and evaluation specialists. The reports address current methodological questions and offer guidance to resolving or advancing the application of high-quality evaluation methods in varying educational contexts.
Completed methods studies, Technical Methods Reports:
- Using State Tests in Education Experiments: A Discussion of the Issues by Henry May, Irma Perez-Johnson, Joshua Haimson, Samina Sattar, and Phil Gleason. Securing data on students' academic achievement is typically one of the most important and costly aspects of conducting education experiments. As state assessment programs have become practically universal and more uniform in terms of grades and subjects tested, the relative appeal of using state tests as a source of study outcome measures has grown. However, the variation in state assessments—in both content and proficiency standards—complicates decisions about whether a particular state test is suitable for research purposes and poses difficulties when planning to combine results across multiple states or grades. This discussion paper aims to help researchers evaluate and make decisions about whether and how to use state test data in education experiments. It outlines the issues that researchers should consider, including how to evaluate the validity and reliability of state tests relative to study purposes; factors influencing the feasibility of collecting state test data; how to analyze state test scores; and whether to combine results based on different tests. It also highlights best practices to help inform ongoing and future experimental studies. Many of the issues discussed are also relevant for non-experimental studies. Publication NCEE 2009-013 (http://ies.ed.gov/ncee/pubs/2009013.asp)
- What to Do When Data Are Missing in Group Randomized Controlled Trials by Michael Puma, Robert B. Olsen, Stephen H. Bell, and Cristofer Price. This NCEE Technical Methods report examines how to address the problem of missing data in the analysis of data in Randomized Controlled Trials (RCTs) of educational interventions, with a particular focus on the common educational situation in which groups of students such as entire classrooms or schools are randomized. Missing outcome data are a problem for two reasons: (1) the loss of sample members can reduce the power to detect statistically significant differences, and (2) the introduction of non-random differences between the treatment and control groups can lead to bias in the estimate of the intervention's effect. The report reviews a selection of methods available for addressing missing data, and then examines their relative performance using extensive simulations that varied a typical educational RCT on three dimensions: (1) the amount of missing data; (2) the level at which data are missing—at the level of whole schools (the assumed unit of randomization) or for students within schools; and, (3) the underlying missing data mechanism. The performance of the different methods is assessed in terms of bias in both the estimated impact and the associated standard error. Publication NCEE 2009-0049 (http://ies.ed.gov/ncee/pubs/20090049.asp)
- Do Typical RCTs of Education Interventions Have Sufficient Statistical Power for Linking Impacts on Teacher Practice and Student Achievement Outcomes by Peter Schochet. For RCTs of education interventions, it is often of interest to estimate associations between student and mediating teacher practice outcomes, to examine the extent to which the study's conceptual model is supported by the data, and to identify specific mediators that are most associated with student learning. This paper develops statistical power formulas for such exploratory analyses under clustered school-based RCTs using ordinary least squares (OLS) and instrumental variable (IV) estimators, and uses these formulas to conduct a simulated power analysis. The power analysis finds that for currently available mediators, the OLS approach will yield precise estimates of associations between teacher practice measures and student test score gains only if the sample contains about 150 to 200 study schools. The IV approach, which can adjust for potential omitted variable and simultaneity biases, has very little statistical power for mediator analyses. For typical RCT evaluations, these results may have design implications for the scope of the data collection effort for obtaining costly teacher practice mediators. Publication NCEE 2009-4065 (http://ies.ed.gov/ncee/pubs/20094065.asp)
- The Estimation of Average Treatment Effects for Clustered RCTs of Education Interventions, by Peter Schochet. This paper examines the estimation of two-stage clustered RCT designs in education research using the Neyman causal inference framework that underlies experiments. The key distinction between the considered causal models is whether potential treatment and control group outcomes are considered to be fixed for the study population (the finite-population model) or randomly selected from a vaguely-defined universe (the super-population model). Appropriate estimators are derived and discussed for each model. Using data from five large-scale clustered RCTs in the education area, the empirical analysis estimates impacts and their standard errors using the considered estimators. For all studies, the estimators yield identical findings concerning statistical significance. However, standard errors sometimes differ, suggesting that policy conclusions from RCTs could be sensitive to the choice of estimator. Thus, a key recommendation is that analysts test the sensitivity of their impact findings using different estimation methods and cluster-level weighting schemes. Publication NCEE 2009-0061 (http://ies.ed.gov/ncee/pubs/20090061.asp)
- Estimation and Identification of the Complier Average Causal Effect Parameter in Education RCTs, by Peter Schochet and Hanley Chiang. In randomized control trials (RCTs) in the education field, the complier average causal effect (CACE) parameter is often of policy interest, because it pertains to intervention effects for students who receive a meaningful dose of treatment services. This report uses a causal inference and instrumental variables framework to examine the identification and estimation of the CACE parameter for two-level clustered RCTs. The report also provides simple asymptotic variance formulas for CACE impact estimators measured in nominal and standard deviation units. In the empirical work, data from ten large RCTs are used to compare significance findings using correct CACE variance estimators and commonly-used approximations that ignore the estimation error in service receipt rates and outcome standard deviations. Our key finding is that the variance corrections have very little effect on the standard errors of standardized CACE impact estimators. Across the examined outcomes, the correction terms typically raise the standard errors by less than 1 percent, and change p-values at the fourth or higher decimal place. Publication NCEE 2009-4040 (http://ies.ed.gov/ncee/pubs/20094040.asp).
- The Late Pretest Problem in Randomized Control Trials of Education Interventions, by Peter Schochet, addresses pretest-posttest experimental designs that are often used in randomized control trials (RCTs) in the education field to improve the precision of the estimated treatment effects. For logistic reasons, however, pretest data are often collected after random assignment, so that including them in the analysis could bias the posttest impact estimates. Thus, the issue of whether to collect and use late pretest data in RCTs involves a variance-bias tradeoff. This paper addresses this issue both theoretically and empirically for several commonly-used impact estimators using a loss function approach that is grounded in the causal inference literature. The key finding is that for RCTs of interventions that aim to improve student test scores, estimators that include late pretests will typically be preferred to estimators that exclude them or that instead include uncontaminated baseline test score data from other sources. This result holds as long as the growth in test score impacts do not grow very quickly early in the school year. Publication NCEE 2009-4033 (http://ies.ed.gov/ncee/pubs/20094033.asp).
- Guidelines for Multiple Testing in Impact Evaluations, by Peter Schochet, presents guidelines for education researchers that address the multiple comparisons problem in impact evaluations in the education area. The problem occurs due to the large number of hypothesis tests that are typically conducted across outcomes and subgroups in evaluation studies, which can lead to spurious significant impact findings. Publication NCEE 2008-4018 (http://ies.ed.gov/ncee/pubs/20084018.asp).
- Statistical Power for Regression Discontinuity Designs in Education Evaluations, by Peter Schochet, examines theoretical and empirical issues related to the statistical power of impact estimates under clustered regression discontinuity (RD) designs. The theory is grounded in the causal inference and HLM modeling literature, and the empirical work focuses on commonly used designs in education research to test intervention effects on student test scores. The main conclusion is that three to four times larger samples are typically required under RD than experimental clustered designs to produce impacts with the same level of statistical precision. Thus, the viability of using RD designs for new impact evaluations of educational interventions may be limited, and will depend on the point of treatment assignment, the availability of pretests, and key research questions. Publication NCEE 2008-4026 (http://ies.ed.gov/ncee/pubs/20084026.asp).