The relative appeal and use of statewide proficiency assessments as sources of outcome measures in education randomized control trials (RCTs) has grown in recent years. This discussion paper examined the issues, methods, and assumptions associated with the use of state test data in education experiments. An important theme emerging from this work is that there are numerous important factors that researchers should carefully consider when deciding whether and how to use state test data in RCT evaluations. Such decisions typically have serious implications for the validity and precision of RCT results.
A number of recommendations pertaining to the design and conduct of RCTs flow from our discussions and should help guide researchers considering using state assessments as a source of outcome measures in their studies. They include the following:
1. Gauge the alignment of specific assessments with the outcome objectives of,
and research questions about, the intervention of interest.
Arguably the most important first step in assessing the suitability of state tests
is identifying the outcomes that the intervention is intended to affect. After defining
the outcomes of interest, research teams should gauge the degree to which specific
assessments are aligned with those objectives, focusing on the central research
questions. If the intervention is expected to affect a relatively narrow and specific
set of skills, it is important to gauge whether those skills are captured sufficiently
well by the assessment, whether the scores reported include those that pertain to
these specific skills, and whether the information across grades and/or states is
consistent. If the intervention seeks to improve students' proficiency on state
standards, then variation in test content appropriately represents the variation
in proficiency goals and standards across grades and states. In other words, in
this latter case, state assessments are aligned by definition and the decision to
use state tests as a source of outcome measures and/or to combine results across
grades and states seems easier to justify.
2. Ensure that the assessment is reliable and appropriate for the study target
population.
The power of an RCT to detect program effects is directly related to the reliability
of the outcome measure used. It is essential that researchers select instruments
that have demonstrated high reliability, producing test scores that are relatively
free of random measurement error. It is important to note that a state test that
has been shown to produce reliable scores for a statewide or national population
might produce unreliable scores if that test is used with a sample of students who
exhibit performance that is substantially above or below average. A test that is
too easy (or too hard) for the study sample might produce many (near) perfect (or
zero) scores, making it impossible to distinguish between the performance of many
students and potentially masking program impacts. Other important considerations
include content coverage, test format, high versus low stakes, participation rates,
and testing accommodations or exceptions.
3. Whenever possible, collect and use baseline measures.
When outcome measures include state assessments, the added effort and/or cost to
obtain data from prior years is usually well justified by the associated increases
in power and other benefits of having baseline data available. The typically high
correlations between waves of annual achievement test scores yield dramatic increases
in the statistical power of an RCT study when prior outcome scores are included
as covariates in statistical models of program impacts. Even if individual-level
data are unavailable, aggregate school-level data might be easily obtained from
school accountability reports for use as a covariate in studies in which schools
represent the unit of random assignment. Baseline data are also useful in confirming
the equivalence of treatment and comparison samples, and in examining the potential
for nonresponse bias for impact estimates.
4. Carefully consider whether and how to combine results based on distinct assessments.
To combine results based on distinct assessments across grades or states, researchers
must demonstrate that several important conditions are met. First and foremost,
differences in the assessments must be viewed as ignorable and reflecting expected
variation in the definition of outcomes targeted by the intervention (for example,
students' ability to meet their state's proficiency standards). If the goal of study
is to demonstrate that an intervention has an effect on specific skills or knowledge,
then combining results across grades or states might be inappropriate unless each
test provides valid, reliable, and comparable data on the targeted outcomes.
If this first condition is met, combining results across states or grades may be appropriate and researchers can choose among several analytic approaches to combine results. The simplest and most powerful approach to produce combined impact estimates involves running multigrade or multistate analyses of pooled individual test scores, once these have been rescaled to z-scores or some other common metric. This approach, however, imposes strong assumptions on the data. First, the study sample from each grade and state must represent a similar cross-section of the overall population of students targeted by the intervention. Second, the distributions of scores from each test must be identical, except for differences in the means and standard deviations of the scale scores.
As discussed in Chapter II, when these further conditions are met, results across grades or states may be combined using relatively straightforward analytic approaches. Because these determinations are subjective, however, researchers are responsible for explicitly making the case that variation in study samples does not preclude combining results.
An alternative to pooling rescaled student-level test scores—and our recommended approach to produce cross-grade or cross-state impact estimates—is to employ meta-analytic strategies. These strategies treat the estimates based on distinct assessments as estimates from separate, independent studies that jointly describe a distribution of treatment effects for the intervention of interest. While computationally more intensive, the strength of these approaches is that they can explicitly test whether the assumptions necessary to generate estimates of average treatment effects are met. Under the meta-analytic framework, the estimates for different grades or states may be combined using a weighted average or analyzed using a meta-regression model. Either approach seems appropriate for those RCTs that are conducted across states or grades and make use of state assessment data.