Skip Navigation
Technical Methods Report: Using State Tests in Education Experiments

NCEE 2009-013
November 2009

Whether to use State Tests in Education Experiments

Researchers have numerous issues to consider when deciding whether or not to use state assessment data in an RCT. Nearly all of these issues can be thought of as related to either the suitability or feasibility of using state assessments. Suitability issues relate to whether the state assessment(s) will provide accurate and useful information about the effects of an intervention.1 Feasibility issues focus on the practical aspects of obtaining access to the necessary state test data.

Our identification and discussion of issues associated with the suitability of state assessments in evaluation research is guided by two key concepts from basic measurement theory. The first concept is validity, which we define as the degree to which the state assessment adequately measures the outcomes targeted by the intervention. Validity issues include considerations about the relevance and appropriateness of the assessments for the intervention and its target population. The second concept is reliability, which we define as the degree to which the state assessment provides scores that are sufficiently free of random measurement error that they can be used to detect program effects.2 Reliability issues concern the precision (and interpretation) of the scores produced and their sensitivity for detecting differences among groups and changes over time.

Our definitions of reliability and validity align closely with definitions from the psychometric literature (see AERA, APA, and NCME 1999; Lord and Novick 1968), while focusing explicitly on the suitability of state assessments as an outcome measure in randomized evaluations. Note also that both concepts are interdependent in that it is impossible to have a valid measure without sufficient reliability (that is, an unreliable score is too noisy to be a valid measure of anything) and it is not useful to have a reliable measure that is not a valid indicator of the outcome of interest.3

In the sections that follow, we discuss issues related to assessing the validity and reliability of state test scores as they relate to randomized evaluations of program effects. We also discuss feasibility issues, which concern the task of gaining access to state test scores and the ability to link individual-level data from one year to the next.

1 In judging the suitability of any assessment, it is clearly important to review the technical details for the test(s) under consideration. Researchers intending to use a state assessment in an evaluation should obtain the technical manual or report published by the test developer or the state department of education. In many cases, states have published these reports on their websites. In other cases, a researcher might need to contact the office of assessment within the state department of education to obtain the technical manual.
2 Reliability is defined strictly in terms of random measurement error as a component of total variability in test scores. This definition does not take into consideration systematic error that would result in measurement bias. Although the presence of random measurement error can detrimentally affect the statistical power of a treatment-control comparison, its presence would not produce a systematic under- or overestimation of the treatment effect (that is, bias). On the other hand, measurement bias that operates differently for treatment and control groups (for example, treatment group scores biased upwards due to a Hawthorne effect) might result in biased impact estimates. Such measurement bias threatens the validity of an instrument for use as an outcome measure in an experiment (see section on "Assessing the Validity of State Assessments for Evaluation Purposes").
3 Some researchers might be used to thinking of suitability issues in terms of the statistical concepts of precision and accuracy. Precision refers to a relative lack of error variance and is largely redundant with the measurement concept of reliability. Accuracy refers to a relative lack of bias (i.e., the correct answer is the most likely result), which is a necessary but not sufficient condition for validity—both accuracy and precision are required for a test to be valid.