- Introduction
- Whether to use State Tests in Education Experiments
- How to use State Test Data in Education Experiments
- Conclustions and Recommendations
- References
- Appendix A: State Testing Programs Under NCLB
- Appendix B: How NCEE-Funded Evaluations use State Test Data
- List of Tables
- PDF & Related Info

Baseline measures in an RCT can serve as a mechanism for blocking or stratifying subjects in an RCT, they can serve to confirm the equivalence of treatment and control conditions prior to an intervention, and they can greatly increase statistical power when they are used as covariates in impact analyses. However, pretest-posttest and longitudinal designs that capitalize on the availability of baseline data raise other issues that can influence how state test data are analyzed. Underpinning many of these issues is the comparability of assessment data across grades and across time. The comparability (or incomparability) of tests can guide a researcher to choose between two primary ways to use baseline scores in an RCT: (1) along with follow-up scores to measure students' explicit gains in achievement (which might then be used as an outcome); or (2) as a covariate to adjust statistically for baseline achievement in a regression or ANCOVA framework.

State assessments are generally designed to align with the state's curriculum and/or performance standards at each grade level. This goal is accomplished with varying success by different states (Rothman, Slattery, Vranek, and Resnick 2002). Furthermore, the clarity and quality of progression in the content of the state standards also varies across states (Finn, Petrilli, and Julian 2006; Schmidt, Wang, and McKnight 2005). The implications of this are that, although linking state test scores over time creates a multiyear assessment profile for each student, the achievement tests providing the scores change each year as students progress from one grade to the next.

In some states (for example, Florida), tests for adjacent grades are explicitly
linked through psychometric equating (Kolen and Brennan 2004). This so-called vertical
equating results in test scores that are on the same developmental scale across
multiple grades.^{16} Critics of vertical
equating nevertheless argue that the shift in content taught and tested at each
grade level makes it impossible to equate tests across several grades using a single
scale. Under this argument, any attempt to use test scores from adjacent grades
to measure absolute change is inherently flawed (Martineau 2006). If the goal is
to produce an explicit measure of change in achievement (for example, gain scores),
it is important to consider the similarity in what is being tested at each grade
level (Linn 1993). We take a pragmatic stance and argue that if the knowledge and
skills measured are consistent across grades, or if they exhibit a logical developmental
progression over multiple grades, then analysis of vertically scaled scores to produce
explicit measures of growth might be the best approach in terms of internal validity
and interpretability. Invariably, the selection of statistical models used to analyze
data from a multiyear study depends on the number of data points collected and whether
the scores are vertically scaled.

In the simplest multiyear study—the pretest-posttest design that involves two years of state test data—there are only two general approaches to analyzing these data: (1) covariance analyses or (2) analyses of difference scores.

*Covariance Analysis.* The more-prevalent approach to analyzing pretest-posttest data
when test content differs across grades involves the use of covariance analysis.
Unlike the difference score approach, whose calculations might inappropriately imply
learning gains,^{17} the covariance analysis
treats the pretest as a control variable to be held constant when estimating group
differences on the outcome variable (Wildt and Ahtola 1978). The objective is not
to estimate a pre-post gain, but to control for differences on the pretest. In fact,
the pretest in covariance analyses need not be directly comparable to the posttest
from a statistical standpoint. Scores from a completely different pretest assessment
may work well as a covariate in analysis of data from an RCT in so much as that
pretest data explains variability in the outcome, thus increasing statistical power
to detect program impacts.

A general criticism of the covariance analysis approach is that it is prone to under-controlling,
because the pretest regression slope is underestimated due to unreliability in the
pretest scores (Sanders 2006).^{18} Fortunately, this criticism is not such a serious
issue in the context of RCTs, because random assignment usually eliminates the need
to adjust for preexisting differences. In the RCT context, the pretest is primarily
a mechanism to reduce error variance and increase statistical power (Shadish, Cook,
and Campbell 2002). However, the underadjustment of pretest scores in an RCT can
make interpretation of impact estimates less straightforward because group differences
are based on regression-adjusted posttest means (that is, residualized gain scores)
instead of simple pre-post difference scores. In addition, under-adjustment due
to imperfect reliability of measured covariates can diminish power to detect effects
(Holland & Rubin, 1983).

*Difference Score Analysis.* The second approach, using difference scores
to analyze pretestposttest data, involves calculating gains by subtracting each
student's pretest score from his or her posttest score. Aside from criticisms focusing
on the unreliability of difference scores (Cronbach and Furby 1970; see Rogosa and
Willett 1983 for a counterargument), a conservative perspective would suggest that
this is appropriate only when the tests from adjacent grades are vertically scaled
and clearly measure very similar content from one year to the next. On the other
hand, a different perspective might enable one to calculate difference scores even
when the tests are not vertically scaled and even when they measure different content.
After converting scores from each test to z-scores, the difference scores would
show differences in performance relative to the average student in standard deviation
units.

An important distinction between difference scores calculated using similar tests and those calculated using different tests is in how the difference scores are interpreted. Without vertical equating and similar content, the difference scores would not reflect differences in students' rates of learning. For example, subtracting a student's third-grade math score (which might focus mostly on whole numbers) from the student's fourth-grade score (which might focus mostly on fractions and decimals) will not necessarily reveal how much a student learned between the end of the third and fourth grades. To get that information, the student would have had to take a different pretest focusing primarily on fractions and decimals.

An alternative is to avoid interpreting difference scores as learning gains. Instead, the difference scores reflect only differences in relative performance from one year to the next. For example, a student might move from one standard deviation above the mean to 1.2 standard deviations above the mean. Although this change cannot be attributed solely to learning that occurred in the past year, such shifts in performance are equalized, on average, across treatment and control groups in an RCT. Therefore, any significant difference in the magnitude of such relative shifts in test scores can serve as unbiased estimates of the impact of the intervention. For example, a significant positive difference between treatment and control groups from an analysis of z-score difference scores could be interpreted as implying that the average percentile ranking of subjects in the treatment group increased more over time than the average percentile ranking of subjects in the control group.

*Repeated Measures Analysis.* When more than two years of data are available,
statistical power can be increased further through the use of repeated measures
analyses or growth curve models (Allison, Allison, Faith, Paultre, and Pi-Sunyer
1997).^{19} Moreover, growth curve models might improve interpretability when the state
test is vertically scaled. This is because results from a growth curve analysis
can be benchmarked against the average achievement growth trajectory for a district
or a state, to determine the degree to which students are making or exceeding a
year's worth of learning gains as a result of an intervention.

It can also be useful (and very inexpensive) to get baseline scores in other subjects for use in a repeated measures model. These additional variables can more fully account for treatment-control differences in baseline achievement and explain additional outcome variation, resulting in further increases to statistical power. This can be achieved by including prior achievement measures on these other subjects as covariates and/or by modeling the covariance structure of residuals for the multiple outcomes in the repeated measures model.