Whether and How to Use State Tests to Measure Student Achievement in a Multi-State Randomized Experiment: An Empirical Assessment Based on Four Recent Evaluations

An important question for educational evaluators is how best to measure academic achievement, the outcome of primary interest in many studies. In large-scale evaluations, student achievement has typically been measured by administering a common standardized test to all students in the study (a "study-administered test"). In the era of No Child Left Behind (NCLB), however, state assessments have become an increasingly viable source of information on student achievement. Using state tests scores can yield substantial cost savings for the study and can eliminate the burden of additional testing on students and teaching staff. On the other hand, state tests can also pose certain difficulties: their content may not be well aligned with the outcomes targeted by the intervention and variation in the content and scale of the tests can complicate pooling scores across states and grades.

This NCEE Reference Report, Whether and How to Use State Tests to Measure Student Achievement in a Multi-State Randomized Experiment: An Empirical Assessment Based on Four Recent Evaluations, examines the sensitivity of impact findings to (1) the type of assessment used to measure achievement (state tests or a study-administered test); and (2) analytical decisions about how to pool state test data across states and grades. These questions are examined using data from four recent IES-funded experimental design studies that measured student achievement using both state tests and a study-administered test. Each study spans multiple states and two of the studies span several grade levels.

Based on these four studies, the authors conclude that:

State tests provide a useful complement to a study-administered test, because they are policy-relevant measures of general achievement. However, in certain cases, state tests may not be a feasible substitute for a study-administered test, either because state tests are not administered in all relevant grades or because the primary outcome is a specific skill that is not measured by all states' tests.
Inferences about program impacts are not sensitive to decisions about how test scores are scaled for the purposes of pooling the results across states or grades (for example, whether traditional or rank-based z-scores are used and whether z-scores are based on the sample or state distribution of scores).
The most appropriate method for aggregating the impact findings across states or grades is to use fixed-effects (precision) weighting, because the conditions for using random-effects weighting are not met.

View, download, and print the full report as a PDF file (4.6 MB)