Skip Navigation

National Center for Education Evaluation and Regional Assistance

Technical Methods

The National Center for Education Evaluation and Regional Assistance (NCEE) conducts unbiased large-scale evaluations of education programs and practices supported by federal funds; provides research-based technical assistance to educators and policymakers; and supports the synthesis and the widespread dissemination of the results of research and evaluation throughout the United States.

In support of this mission, NCEE promotes methodological advancement in the field of education evaluation through investigations involving analyses using existing data sets and explorations of applications of new technical methods, including cost-effectiveness of alternative evaluation strategies. The results of these methodological investigations are published as commissioned, peer reviewed papers, under the series title, Technical Methods Reports. These reports are specifically designed for use by researchers, methodologists, and evaluation specialists. The reports address current methodological questions and offer guidance to resolving or advancing the application of high-quality evaluation methods in varying educational contexts.

In addition to the current Technical Methods Reports, the series has been expanded to produce a new type of report to provide researchers and evaluators with examples of practical applications of the methods work. The NCEE Reference Report series is designed to advance the practice of rigorous education research by making available to education researchers and users of education research focused resources to facilitate the design of future studies and to help users of completed studies better understand their strengths and limitations.

Subjects selected for NCEE Reference Reports are those that examine and review rigorous evaluation studies conducted under NCEE to extract examples of good or promising evaluation practices. The reports present study information to demonstrate the possible range of "solutions" so far developed. In this way, NCEE Reference Reports are aimed to promote cost-effective study designs by identifying examples of the use of similar and/or reliable methods, measures, or analyses across evaluations. It is important to note that NCEE Reference Reports are not meant to resolve common methodological issues in conducting education evaluation. Rather, they present information about how current evaluations under NCEE have focused on an issue or selected measurement and analysis strategies. Compilations are cross-walks that make information buried in study reports more accessible for immediate use by the researcher or the evaluator.

Completed methods studies, Technical Methods Reports:

  • Replicating Experimental Impact Estimates Using a Regression Discontinuity Approach by Philip M. Gleason, Alexandra M. Resch, and Jillian A. Berk. This NCEE Technical Methods Paper compares the estimated impacts of an educational intervention using experimental and regression discontinuity (RD) study designs. The analysis used data from two large-scale randomized controlled trials—the Education Technology Evaluation and the Teach for America Study—to provide evidence on the performance of RD estimators in two specific contexts. More generally, the report presents and implements a method for examining the performance of RD estimators that could be used in other contexts. The study found that the RD and experimental designs produced impact estimates that were meaningful in size, though not significantly different from one another. The study also found that manipulation of the assignment variable in RD designs can substantially influence RD impact estimates, particularly if manipulation is related to the outcome and occurs close to the assignment variable's cutoff value. Publication NCEE 2012-4025
  • Whether and How to Use State Tests to Measure Student Achievement in a Multi-State Randomized Experiment: An Empirical Assessment Based on Four Recent Evaluations by Marie-Andrée Somers, Pei Zhu, and Edmond Wong. NCEEAn important question for educational evaluators is how best to measure academic achievement, the outcome of primary interest in many studies. In large-scale evaluations, student achievement has typically been measured by administering a common standardized test to all students in the study (a "study-administered test"). In the era of No Child Left Behind (NCLB), however, state assessments have become an increasingly viable source of information on student achievement. Using state tests scores can yield substantial cost savings for the study and can eliminate the burden of additional testing on students and teaching staff. On the other hand, state tests can also pose certain difficulties: their content may not be well aligned with the outcomes targeted by the intervention and variation in the content and scale of the tests can complicate pooling scores across states and grades. Publication NCEE 2012-4015
  • Estimating the Impacts of Educational Interventions Using State Tests or Study-Administered Tests by Robert B. Olsen, Fatih Unlu, Cristofer Price, and Andrew P. Jaciw. State assessments provide a relatively inexpensive and increasingly accessible source of data on student achievement. In the past, rigorous evaluations of educational interventions typically administered standardized tests selected by the researchers ("study-administered tests") to measure student achievement outcomes. Increasingly, researchers are turning to the lower cost option of using state assessments for measures of student achievement. Publication NCEE 2012-4016
  • Variability in Pretest-Posttest Correlation Coefficients by Student Achievement Level by Russell Cole, Joshua Haimson, Irma Perez-Johnson, and Henry May. State assessments are increasingly used as outcome measures for education evaluations and pretest scores are generally used as control variables in these evaluations. The correlation between the pretest and outcome (posttest) measures is a factor in determining, among other things, the statistical power of a study. This report examines the variability in pretest-posttest correlation coefficients for state assessment data on samples of low-performing, average-performing, and proficient students to determine how sample characteristics (e.g., achievement level) affect pretest-posttest correlation coefficients. As an application, this report illustrates how statistical power is affected by variations in pretest-posttest correlation coefficients across groups with different sample characteristics. Achievement data from four states and two large districts are examined. The results confirm that pretest-posttest correlation coefficients are smaller for samples of low performers than for samples representing the full range of performers, thus, resulting in lower statistical power for impact studies than would be the case if the study sample included a more representative group of students. Publication NCEE 2011-4033
  • Using an Experimental Evaluation of Charter Schools to Test Whether Nonexperimental Comparison Group Methods Can Replicate Experimental Impact Estimates by Philip Gleason, Melissa Clark, Christina Clark Tuttle, and Emily Dwoyer. This NCEE Technical Methods Paper compares the estimated impacts of the offer of charter school enrollment using an experimental design and a non-experimental comparison group design. The study examined four different approaches to creating non-experimental comparison groups ordinary least squares regression modeling, exact matching, propensity score matching, and fixed effects modeling. The data for the study are from students in the districts and grades that were represented in an experimental design evaluation of charter schools conducted by the U.S. Department of Education in 2010. Publication NCEE 20104019
  • Precision Gains from Publically Available School Proficiency Measures Compared to Study-Collected Test Scores in Education Cluster-Randomized Trials by John Deke, Lisa Dragoset, and Ravaris Moore. In randomized controlled trials (RCTs) where the outcome is a student-level, study-collected test score, a particularly valuable piece of information is a study-collected baseline score from the same or similar test (a pre-test). Pre-test scores can be used to increase the precision of impact estimates, conduct subgroup analysis, and reduce bias from missing data at follow up. Although administering baseline tests provides analytic benefits, there may be less expensive ways to achieve some of the same benefits, such as using publically available school-level proficiency data. This paper compares the precision gains from adjusting impact estimates for student-level pre-test scores (which can be costly to collect) with the gains associated with using publically available school-level proficiency data (available at low cost), using data from five large-scale RCTs conducted for the Institute of Education Sciences. The study finds that, on average, adjusting for school-level proficiency does not increase statistical precision as well as student-level baseline test scores. Across the cases we examined, the number of schools included in studies would have to nearly double in order to compensate for the loss in precision of using school-level proficiency data instead of student-level baseline test data. Publication NCEE 2010-4003 (
  • Error Rates for Measuring Teacher and School Performance Using Value-Added Models by Peter Z. Schochet and Hanley S. Chiang. This paper addresses likely error rates for measuring teacher and school performance in the upper elementary grades using value-added models applied to student test score gain data. Using realistic performance measurement system schemes based on hypothesis testing, we develop error rate formulas based on OLS and Empirical Bayes estimators. Simulation results suggest that value-added estimates are likely to be noisy using the amount of data that are typically used in practice. Type I and II error rates for comparing a teacher’s performance to the average are likely to be about 25 percent with three years of data and 35 percent with one year of data. Corresponding error rates for overall false positive and negative errors are 10 and 20 percent, respectively. Lower error rates can be achieved if schools are the performance unit. The results suggest that policymakers must carefully consider likely system error rates when using value-added estimates to make high-stakes decisions regarding educators. Publication NCEE 2010-4004 (
  • Survey of Outcomes Measurement in Research on Character Education Programs by Ann E Person, Emily Moiduddin, Megan Hague-Angus, and Lizabeth M. Malone. Character education programs are school-based programs that have as one of their objectives promoting the character development of students. This report systematically examines the outcomes that were measured in evaluations of a delimited set of character education programs and the research tools used for measuring the targeted outcomes. The multi-faceted nature of character development and many possible ways of conceptualizing it, the large and growing number of school-based programs to promote character development, and the relative newness of efforts to evaluate character education programs using rigorous research methods all combine to make the selection or development of measures relevant to the evaluation of these programs especially challenging. This report is a step toward creating a resource that can inform measure selection for conducting rigorous, cost effective studies of character education programs. The report, however, does not provide comprehensive information on all measures or types of measures, guidance on specific measures, or recommendations on specific measures. Publication NCEE 2009-006 (
  • Using State Tests in Education Experiments: A Discussion of the Issues by Henry May, Irma Perez-Johnson, Joshua Haimson, Samina Sattar, and Phil Gleason. Securing data on students' academic achievement is typically one of the most important and costly aspects of conducting education experiments. As state assessment programs have become practically universal and more uniform in terms of grades and subjects tested, the relative appeal of using state tests as a source of study outcome measures has grown. However, the variation in state assessments—in both content and proficiency standards—complicates decisions about whether a particular state test is suitable for research purposes and poses difficulties when planning to combine results across multiple states or grades. This discussion paper aims to help researchers evaluate and make decisions about whether and how to use state test data in education experiments. It outlines the issues that researchers should consider, including how to evaluate the validity and reliability of state tests relative to study purposes; factors influencing the feasibility of collecting state test data; how to analyze state test scores; and whether to combine results based on different tests. It also highlights best practices to help inform ongoing and future experimental studies. Many of the issues discussed are also relevant for non-experimental studies. Publication NCEE 2009-013 (
  • What to Do When Data Are Missing in Group Randomized Controlled Trials by Michael Puma, Robert B. Olsen, Stephen H. Bell, and Cristofer Price. This NCEE Technical Methods report examines how to address the problem of missing data in the analysis of data in Randomized Controlled Trials (RCTs) of educational interventions, with a particular focus on the common educational situation in which groups of students such as entire classrooms or schools are randomized. Missing outcome data are a problem for two reasons: (1) the loss of sample members can reduce the power to detect statistically significant differences, and (2) the introduction of non-random differences between the treatment and control groups can lead to bias in the estimate of the intervention's effect. The report reviews a selection of methods available for addressing missing data, and then examines their relative performance using extensive simulations that varied a typical educational RCT on three dimensions: (1) the amount of missing data; (2) the level at which data are missing—at the level of whole schools (the assumed unit of randomization) or for students within schools; and, (3) the underlying missing data mechanism. The performance of the different methods is assessed in terms of bias in both the estimated impact and the associated standard error. Publication NCEE 2009-0049 (
  • Do Typical RCTs of Education Interventions Have Sufficient Statistical Power for Linking Impacts on Teacher Practice and Student Achievement Outcomes by Peter Schochet. For RCTs of education interventions, it is often of interest to estimate associations between student and mediating teacher practice outcomes, to examine the extent to which the study's conceptual model is supported by the data, and to identify specific mediators that are most associated with student learning. This paper develops statistical power formulas for such exploratory analyses under clustered school-based RCTs using ordinary least squares (OLS) and instrumental variable (IV) estimators, and uses these formulas to conduct a simulated power analysis. The power analysis finds that for currently available mediators, the OLS approach will yield precise estimates of associations between teacher practice measures and student test score gains only if the sample contains about 150 to 200 study schools. The IV approach, which can adjust for potential omitted variable and simultaneity biases, has very little statistical power for mediator analyses. For typical RCT evaluations, these results may have design implications for the scope of the data collection effort for obtaining costly teacher practice mediators. Publication NCEE 2009-4065 (
  • The Estimation of Average Treatment Effects for Clustered RCTs of Education Interventions, by Peter Schochet. This paper examines the estimation of two-stage clustered RCT designs in education research using the Neyman causal inference framework that underlies experiments. The key distinction between the considered causal models is whether potential treatment and control group outcomes are considered to be fixed for the study population (the finite-population model) or randomly selected from a vaguely-defined universe (the super-population model). Appropriate estimators are derived and discussed for each model. Using data from five large-scale clustered RCTs in the education area, the empirical analysis estimates impacts and their standard errors using the considered estimators. For all studies, the estimators yield identical findings concerning statistical significance. However, standard errors sometimes differ, suggesting that policy conclusions from RCTs could be sensitive to the choice of estimator. Thus, a key recommendation is that analysts test the sensitivity of their impact findings using different estimation methods and cluster-level weighting schemes. Publication NCEE 2009-0061 (
  • Estimation and Identification of the Complier Average Causal Effect Parameter in Education RCTs, by Peter Schochet and Hanley Chiang. In randomized control trials (RCTs) in the education field, the complier average causal effect (CACE) parameter is often of policy interest, because it pertains to intervention effects for students who receive a meaningful dose of treatment services. This report uses a causal inference and instrumental variables framework to examine the identification and estimation of the CACE parameter for two-level clustered RCTs. The report also provides simple asymptotic variance formulas for CACE impact estimators measured in nominal and standard deviation units. In the empirical work, data from ten large RCTs are used to compare significance findings using correct CACE variance estimators and commonly-used approximations that ignore the estimation error in service receipt rates and outcome standard deviations. Our key finding is that the variance corrections have very little effect on the standard errors of standardized CACE impact estimators. Across the examined outcomes, the correction terms typically raise the standard errors by less than 1 percent, and change p-values at the fourth or higher decimal place. Publication NCEE 2009-4040 (
  • The Late Pretest Problem in Randomized Control Trials of Education Interventions, by Peter Schochet, addresses pretest-posttest experimental designs that are often used in randomized control trials (RCTs) in the education field to improve the precision of the estimated treatment effects. For logistic reasons, however, pretest data are often collected after random assignment, so that including them in the analysis could bias the posttest impact estimates. Thus, the issue of whether to collect and use late pretest data in RCTs involves a variance-bias tradeoff. This paper addresses this issue both theoretically and empirically for several commonly-used impact estimators using a loss function approach that is grounded in the causal inference literature. The key finding is that for RCTs of interventions that aim to improve student test scores, estimators that include late pretests will typically be preferred to estimators that exclude them or that instead include uncontaminated baseline test score data from other sources. This result holds as long as the growth in test score impacts do not grow very quickly early in the school year. Publication NCEE 2009-4033 (
  • Guidelines for Multiple Testing in Impact Evaluations, by Peter Schochet, presents guidelines for education researchers that address the multiple comparisons problem in impact evaluations in the education area. The problem occurs due to the large number of hypothesis tests that are typically conducted across outcomes and subgroups in evaluation studies, which can lead to spurious significant impact findings. Publication NCEE 2008-4018 (
  • Statistical Power for Regression Discontinuity Designs in Education Evaluations, by Peter Schochet, examines theoretical and empirical issues related to the statistical power of impact estimates under clustered regression discontinuity (RD) designs. The theory is grounded in the causal inference and HLM modeling literature, and the empirical work focuses on commonly used designs in education research to test intervention effects on student test scores. The main conclusion is that three to four times larger samples are typically required under RD than experimental clustered designs to produce impacts with the same level of statistical precision. Thus, the viability of using RD designs for new impact evaluations of educational interventions may be limited, and will depend on the point of treatment assignment, the availability of pretests, and key research questions. Publication NCEE 2008-4026 (