Technical Methods Report: Using State Tests in Education Experiments

Technical Methods Report: Using State Tests in Education Experiments
A Discussion of the Issues

NCEE 2009-013
November 2009

How to Use Baseline Measures

Baseline measures in an RCT can serve as a mechanism for blocking or stratifying subjects in an RCT, they can serve to confirm the equivalence of treatment and control conditions prior to an intervention, and they can greatly increase statistical power when they are used as covariates in impact analyses. However, pretest-posttest and longitudinal designs that capitalize on the availability of baseline data raise other issues that can influence how state test data are analyzed. Underpinning many of these issues is the comparability of assessment data across grades and across time. The comparability (or incomparability) of tests can guide a researcher to choose between two primary ways to use baseline scores in an RCT: (1) along with follow-up scores to measure students' explicit gains in achievement (which might then be used as an outcome); or (2) as a covariate to adjust statistically for baseline achievement in a regression or ANCOVA framework.

State assessments are generally designed to align with the state's curriculum and/or performance standards at each grade level. This goal is accomplished with varying success by different states (Rothman, Slattery, Vranek, and Resnick 2002). Furthermore, the clarity and quality of progression in the content of the state standards also varies across states (Finn, Petrilli, and Julian 2006; Schmidt, Wang, and McKnight 2005). The implications of this are that, although linking state test scores over time creates a multiyear assessment profile for each student, the achievement tests providing the scores change each year as students progress from one grade to the next.

In some states (for example, Florida), tests for adjacent grades are explicitly linked through psychometric equating (Kolen and Brennan 2004). This so-called vertical equating results in test scores that are on the same developmental scale across multiple grades.¹⁶ Critics of vertical equating nevertheless argue that the shift in content taught and tested at each grade level makes it impossible to equate tests across several grades using a single scale. Under this argument, any attempt to use test scores from adjacent grades to measure absolute change is inherently flawed (Martineau 2006). If the goal is to produce an explicit measure of change in achievement (for example, gain scores), it is important to consider the similarity in what is being tested at each grade level (Linn 1993). We take a pragmatic stance and argue that if the knowledge and skills measured are consistent across grades, or if they exhibit a logical developmental progression over multiple grades, then analysis of vertically scaled scores to produce explicit measures of growth might be the best approach in terms of internal validity and interpretability. Invariably, the selection of statistical models used to analyze data from a multiyear study depends on the number of data points collected and whether the scores are vertically scaled.

In the simplest multiyear study—the pretest-posttest design that involves two years of state test data—there are only two general approaches to analyzing these data: (1) covariance analyses or (2) analyses of difference scores.

Covariance Analysis. The more-prevalent approach to analyzing pretest-posttest data when test content differs across grades involves the use of covariance analysis. Unlike the difference score approach, whose calculations might inappropriately imply learning gains,¹⁷ the covariance analysis treats the pretest as a control variable to be held constant when estimating group differences on the outcome variable (Wildt and Ahtola 1978). The objective is not to estimate a pre-post gain, but to control for differences on the pretest. In fact, the pretest in covariance analyses need not be directly comparable to the posttest from a statistical standpoint. Scores from a completely different pretest assessment may work well as a covariate in analysis of data from an RCT in so much as that pretest data explains variability in the outcome, thus increasing statistical power to detect program impacts.

A general criticism of the covariance analysis approach is that it is prone to under-controlling, because the pretest regression slope is underestimated due to unreliability in the pretest scores (Sanders 2006).¹⁸ Fortunately, this criticism is not such a serious issue in the context of RCTs, because random assignment usually eliminates the need to adjust for preexisting differences. In the RCT context, the pretest is primarily a mechanism to reduce error variance and increase statistical power (Shadish, Cook, and Campbell 2002). However, the underadjustment of pretest scores in an RCT can make interpretation of impact estimates less straightforward because group differences are based on regression-adjusted posttest means (that is, residualized gain scores) instead of simple pre-post difference scores. In addition, under-adjustment due to imperfect reliability of measured covariates can diminish power to detect effects (Holland & Rubin, 1983).

Difference Score Analysis. The second approach, using difference scores to analyze pretestposttest data, involves calculating gains by subtracting each student's pretest score from his or her posttest score. Aside from criticisms focusing on the unreliability of difference scores (Cronbach and Furby 1970; see Rogosa and Willett 1983 for a counterargument), a conservative perspective would suggest that this is appropriate only when the tests from adjacent grades are vertically scaled and clearly measure very similar content from one year to the next. On the other hand, a different perspective might enable one to calculate difference scores even when the tests are not vertically scaled and even when they measure different content. After converting scores from each test to z-scores, the difference scores would show differences in performance relative to the average student in standard deviation units.

An important distinction between difference scores calculated using similar tests and those calculated using different tests is in how the difference scores are interpreted. Without vertical equating and similar content, the difference scores would not reflect differences in students' rates of learning. For example, subtracting a student's third-grade math score (which might focus mostly on whole numbers) from the student's fourth-grade score (which might focus mostly on fractions and decimals) will not necessarily reveal how much a student learned between the end of the third and fourth grades. To get that information, the student would have had to take a different pretest focusing primarily on fractions and decimals.

An alternative is to avoid interpreting difference scores as learning gains. Instead, the difference scores reflect only differences in relative performance from one year to the next. For example, a student might move from one standard deviation above the mean to 1.2 standard deviations above the mean. Although this change cannot be attributed solely to learning that occurred in the past year, such shifts in performance are equalized, on average, across treatment and control groups in an RCT. Therefore, any significant difference in the magnitude of such relative shifts in test scores can serve as unbiased estimates of the impact of the intervention. For example, a significant positive difference between treatment and control groups from an analysis of z-score difference scores could be interpreted as implying that the average percentile ranking of subjects in the treatment group increased more over time than the average percentile ranking of subjects in the control group.

Repeated Measures Analysis. When more than two years of data are available, statistical power can be increased further through the use of repeated measures analyses or growth curve models (Allison, Allison, Faith, Paultre, and Pi-Sunyer 1997).¹⁹ Moreover, growth curve models might improve interpretability when the state test is vertically scaled. This is because results from a growth curve analysis can be benchmarked against the average achievement growth trajectory for a district or a state, to determine the degree to which students are making or exceeding a year's worth of learning gains as a result of an intervention.

It can also be useful (and very inexpensive) to get baseline scores in other subjects for use in a repeated measures model. These additional variables can more fully account for treatment-control differences in baseline achievement and explain additional outcome variation, resulting in further increases to statistical power. This can be achieved by including prior achievement measures on these other subjects as covariates and/or by modeling the covariance structure of residuals for the multiple outcomes in the repeated measures model.

Top

¹⁶ Theoretically, vertically equated tests would produce the same score regardless of which version of the test is taken (Kolen and Brennan 2004; Holland and Dorans 2006). In other words, a fourth grader's score should remain the same, even if she took the fifth-grade version of the vertically scaled test.
¹⁷ We use the term "difference score" here to signify a simple subtraction of the pretest score from the posttest score. Because state tests are seldom equated from one grade to the next, subtraction of these two scores might have little or no interpretation. In the case in which the tests are scaled to have the same mean score in every grade, the expected difference for the average student is zero. Thus, difference scores for tests that do not have a vertical scale cannot be used to reflect absolute annual learning gains.
¹⁸ In regression and covariance models, parameter estimates for any predictor variable measured with error will be attenuated toward zero by an amount equal to one minus the reliability of that predictor (see Neter, Kuter, Nachtscheim, and Wasserman 1996, p. 164). Because achievement test scores always have less-than-perfect reliability, the slope estimate for the pretest covariate will be attenuated, resulting in underadjustment of pretest scores.
¹⁹ When these models are estimated in an HLM or mixed model framework, they also have the additional advantage of handling missing data without the need for listwise deletion or imputation (Raudenbush and Bryk 2002; Singer and Willett 2003). In both the difference score models and the covariance models a missing pretest score or posttest score for a student means that student is excluded from the analysis unless alternative means for dealing with missing data are implemented such as multiple imputation. Repeated measures and growth curve models enable students with missing data to be included in the analysis, under the assumption that their data is missing at random (MAR) (see Rubin 1987 for a definition of MAR).