Skip Navigation
Technical Methods Report: Using State Tests in Education Experiments

NCEE 2009-013
November 2009

Conclustions and Recommendations

The relative appeal and use of statewide proficiency assessments as sources of outcome measures in education randomized control trials (RCTs) has grown in recent years. This discussion paper examined the issues, methods, and assumptions associated with the use of state test data in education experiments. An important theme emerging from this work is that there are numerous important factors that researchers should carefully consider when deciding whether and how to use state test data in RCT evaluations. Such decisions typically have serious implications for the validity and precision of RCT results.

A number of recommendations pertaining to the design and conduct of RCTs flow from our discussions and should help guide researchers considering using state assessments as a source of outcome measures in their studies. They include the following:

1. Gauge the alignment of specific assessments with the outcome objectives of, and research questions about, the intervention of interest.
Arguably the most important first step in assessing the suitability of state tests is identifying the outcomes that the intervention is intended to affect. After defining the outcomes of interest, research teams should gauge the degree to which specific assessments are aligned with those objectives, focusing on the central research questions. If the intervention is expected to affect a relatively narrow and specific set of skills, it is important to gauge whether those skills are captured sufficiently well by the assessment, whether the scores reported include those that pertain to these specific skills, and whether the information across grades and/or states is consistent. If the intervention seeks to improve students' proficiency on state standards, then variation in test content appropriately represents the variation in proficiency goals and standards across grades and states. In other words, in this latter case, state assessments are aligned by definition and the decision to use state tests as a source of outcome measures and/or to combine results across grades and states seems easier to justify.

In evaluating program impacts, it is also important to identify not only the ultimate outcome objectives but also the intermediate outcomes and mechanisms through which the intervention might achieve such objectives. Elaborating this theory of action can help identify additional outcomes and/or processes that the study should measure. By estimating program impacts across a set of relevant outcomes, an RCT might provide information about potential variation in effects across different outcomes. It is nevertheless important to explicitly connect the outcomes examined to the intervention's theory of action, and to focus mainly on the primary outcomes the intervention intends to influence. When multiple outcome measures are examined, researchers also need to adjust significance levels to account for multiple comparisons.

2. Ensure that the assessment is reliable and appropriate for the study target population.
The power of an RCT to detect program effects is directly related to the reliability of the outcome measure used. It is essential that researchers select instruments that have demonstrated high reliability, producing test scores that are relatively free of random measurement error. It is important to note that a state test that has been shown to produce reliable scores for a statewide or national population might produce unreliable scores if that test is used with a sample of students who exhibit performance that is substantially above or below average. A test that is too easy (or too hard) for the study sample might produce many (near) perfect (or zero) scores, making it impossible to distinguish between the performance of many students and potentially masking program impacts. Other important considerations include content coverage, test format, high versus low stakes, participation rates, and testing accommodations or exceptions.

3. Whenever possible, collect and use baseline measures.
When outcome measures include state assessments, the added effort and/or cost to obtain data from prior years is usually well justified by the associated increases in power and other benefits of having baseline data available. The typically high correlations between waves of annual achievement test scores yield dramatic increases in the statistical power of an RCT study when prior outcome scores are included as covariates in statistical models of program impacts. Even if individual-level data are unavailable, aggregate school-level data might be easily obtained from school accountability reports for use as a covariate in studies in which schools represent the unit of random assignment. Baseline data are also useful in confirming the equivalence of treatment and comparison samples, and in examining the potential for nonresponse bias for impact estimates.

4. Carefully consider whether and how to combine results based on distinct assessments.
To combine results based on distinct assessments across grades or states, researchers must demonstrate that several important conditions are met. First and foremost, differences in the assessments must be viewed as ignorable and reflecting expected variation in the definition of outcomes targeted by the intervention (for example, students' ability to meet their state's proficiency standards). If the goal of study is to demonstrate that an intervention has an effect on specific skills or knowledge, then combining results across grades or states might be inappropriate unless each test provides valid, reliable, and comparable data on the targeted outcomes.

If this first condition is met, combining results across states or grades may be appropriate and researchers can choose among several analytic approaches to combine results. The simplest and most powerful approach to produce combined impact estimates involves running multigrade or multistate analyses of pooled individual test scores, once these have been rescaled to z-scores or some other common metric. This approach, however, imposes strong assumptions on the data. First, the study sample from each grade and state must represent a similar cross-section of the overall population of students targeted by the intervention. Second, the distributions of scores from each test must be identical, except for differences in the means and standard deviations of the scale scores.

As discussed in Chapter II, when these further conditions are met, results across grades or states may be combined using relatively straightforward analytic approaches. Because these determinations are subjective, however, researchers are responsible for explicitly making the case that variation in study samples does not preclude combining results.

An alternative to pooling rescaled student-level test scores—and our recommended approach to produce cross-grade or cross-state impact estimates—is to employ meta-analytic strategies. These strategies treat the estimates based on distinct assessments as estimates from separate, independent studies that jointly describe a distribution of treatment effects for the intervention of interest. While computationally more intensive, the strength of these approaches is that they can explicitly test whether the assumptions necessary to generate estimates of average treatment effects are met. Under the meta-analytic framework, the estimates for different grades or states may be combined using a weighted average or analyzed using a meta-regression model. Either approach seems appropriate for those RCTs that are conducted across states or grades and make use of state assessment data.