Researchers have numerous issues to consider when deciding whether or not to use state assessment data in an RCT. Nearly all of these issues can be thought of as related to either the suitability or feasibility of using state assessments. Suitability issues relate to whether the state assessment(s) will provide accurate and useful information about the effects of an intervention.1 Feasibility issues focus on the practical aspects of obtaining access to the necessary state test data.
Our identification and discussion of issues associated with the suitability of state assessments in evaluation research is guided by two key concepts from basic measurement theory. The first concept is validity, which we define as the degree to which the state assessment adequately measures the outcomes targeted by the intervention. Validity issues include considerations about the relevance and appropriateness of the assessments for the intervention and its target population. The second concept is reliability, which we define as the degree to which the state assessment provides scores that are sufficiently free of random measurement error that they can be used to detect program effects.2 Reliability issues concern the precision (and interpretation) of the scores produced and their sensitivity for detecting differences among groups and changes over time.
Our definitions of reliability and validity align closely with definitions from the psychometric literature (see AERA, APA, and NCME 1999; Lord and Novick 1968), while focusing explicitly on the suitability of state assessments as an outcome measure in randomized evaluations. Note also that both concepts are interdependent in that it is impossible to have a valid measure without sufficient reliability (that is, an unreliable score is too noisy to be a valid measure of anything) and it is not useful to have a reliable measure that is not a valid indicator of the outcome of interest.3
In the sections that follow, we discuss issues related to assessing the validity and reliability of state test scores as they relate to randomized evaluations of program effects. We also discuss feasibility issues, which concern the task of gaining access to state test scores and the ability to link individual-level data from one year to the next.