Skip Navigation
Technical Methods Report: Using State Tests in Education Experiments

NCEE 2009-013
November 2009

Assessing the Validity of State Assessments for Evaluation Purposes

Perhaps the most important first step when evaluating the utility of state assessment data in an RCT is to identify the outcomes specified in the research questions. If the research questions focus on specific skills that are theorized to be influenced by the intervention, then the state assessment will be useful to the extent that it measures those specific skills. On the other hand, if the research questions focus on the ability of the intervention to improve overall performance on the state assessment, then the skills and knowledge measured by the state test comprise the de facto outcomes targeted by the intervention.

In evaluations in which the research questions focus on specific skills, the alignment of the state test can be established by determining the proportion of test items that measure skills and knowledge targeted by the intervention.4 In evaluations in which the research questions focus on overall achievement or proficiency as defined by the state test, it would be important to justify the expectation that the intervention can have a significant impact on such broad measures of student performance.

Subject Area and Test Domain Alignment. A more detailed examination of the issue of alignment between the assessment and the intervention might be the next crucial consideration in establishing the validity of a state assessment as a source of outcome measures in an RCT. This is because, generally speaking, estimated impacts will typically be largest when the outcome measure aligns closely with the outcomes targeted by the intervention and smaller when outcome measures and the intervention's targeted outcomes are not closely aligned. In other words, the largest impact estimate would be expected when the test measures those aspects of student performance that the intervention is designed to affect.

The most obvious aspect of alignment concerns the subjects tested and the domains tested within each subject. Consider, for example, a study in which researchers are considering using the scores from a third-grade state assessment in English Language Arts (ELA) to evaluate the impacts of an intervention that relies heavily on techniques of guided and shared reading in order to develop students' fluency and comprehension. The theory of action behind the intervention is that, in order for students to comprehend what they read, they must be able to read fluently. Hence, the intervention seeks to improve comprehension primarily through improvements in fluency. The state assessment uses multiple-choice items exclusively to measure ELA achievement. It produces an overall score and two subscale scores: (1) reading comprehension and (2) vocabulary. Without an additional measure of reading fluency, the evaluation risks failing to detect a program effect, because the state test will not provide valid information about the primary outcome targeted by the intervention—reading fluency.

This lack of alignment is even more obvious when the intervention of interest focuses on a subject that is not included in the battery of state assessments. This is a common problem for interventions in early literacy, social studies, science, and most high school subjects because a state assessment might not exist for the targeted subject and grade. In cases in which a sufficiently aligned test simply does not exist, the use of state assessments for RCT outcome measures might not be a viable option.

There are, nevertheless, instances in which the state test could be used despite imperfect alignment with the intervention. If the goal of the intervention is to improve reading or math skills in the context of other classes (for example, social studies or science), it might be defensible to use the state reading or math test scores that reflect general (rather than subject-specific) skills as an outcome measure. In the reading intervention example above, although the primary targeted outcome (reading fluency) was not captured by the state test, a secondary outcome (reading comprehension) was captured. It might therefore be argued that the reading comprehension scores can still serve as a useful outcome measure. However, the study's power for detecting a treatment effect might be lower if the size of the intervention's effect on reading comprehension is smaller than for fluency. Perhaps the best situation involves multiple measures, including both direct and indirect outcomes as specified in the intervention's theory of action. Analyses would produce impact estimates for multiple outcomes (with appropriate adjustments for multiple statistical tests), thus providing a comprehensive test of the program theory of action.

Last, a well-aligned state test that is administered more than a year after intervention might still provide value in the context of an RCT. This situation is quite likely for high school interventions, in which the state assessment is typically administered in only one grade or as end-of-course tests. For example, May and Robinson (2007) conducted an RCT in which they found positive impacts of individualized assessment reports and feedback on student performance on the Ohio Graduation Test (OGT), a test which is administered in the 10th grade, but also in 11th and 12th grades for students who do not pass on their first attempt.

Breadth and Depth of Test Items. Related to the issue of subject area and domain alignment are the specific knowledge and skills assessed by the state test. Most state tests rely on 40 to 50 multiple-choice items (see Appendix A). With this in mind, a researcher considering a state test as an outcome measure for an RCT should include in his or her appraisal of test content whether the types of knowledge and skills that the intervention is designed to foster are captured sufficiently well by the items on the state test. This assessment can be based on the proportion of test items that measure skills and knowledge targeted by the intervention. Note, however, that this simple approach does not take into account the difficulty of relevant test items.

Stakes of Testing. Researchers should also investigate the stakes or consequences attached to performance on the state assessments. Impact estimates based on low-stakes tests might be biased by motivational differences among students; those based on high-stakes tests (for example, state accountability tests) might be biased by cheating. Common sense suggests, and research confirms, that performance on low-stakes or no-stakes assessments tends to be lower than on high-stakes tests (Wise and DeMars 2005; Segal 2006). If an RCT evaluation includes an administration of its own assessment, a lack of incentives for students to perform well might lead to biased results if treatment and control students do not put forth the same level of effort on the test.5 This problem of differential motivation with low-stakes tests might be even more difficult to address than the problem of cheating associated with high-stakes tests (see Amrein and Berliner 2002). Although cheating might be kept to a minimum by implementing controlled testing procedures in high-stakes situations, ensuring that student motivation is consistently high in low-stakes situations might be impossible without external performance incentives (for example, rewarding students for high scores). Therefore, one potential advantage of using scores from a state assessment is increased validity due to the presence of high-stakes incentives and a tightly controlled testing environment. In addition, it is important that motivation and incentives, which are likely to vary across states and districts, be balanced across the treatment and control conditions, presumably through random assignment.

Participation Rates. Another important consideration for research teams evaluating whether to use state tests is participation rates. Fortunately, participation rates for state assessments are generally very high (Riddle 2005) and are almost always higher than the minimum participation rate set by RCT researchers. This is because federal and state accountability programs require very high participation rates in state testing programs (see Appendix A). Although there might be little incentive for students to participate in an assessment administered exclusively as part of a research study, participation in state assessments is typically mandatory. Furthermore, differential participation by the treatment and control groups might be more likely to occur when using an external assessment (for example, if control students become dispersed across many schools or districts and are difficult for the researchers to locate). This suggests that state assessments might be more attractive as outcome measures in RCTs simply because participation rates are so high.

Testing Accommodations and Exemptions. Researchers should pay attention to accommodation and related policies that might influence the quality of test data, participation rates, and/or test performance for the study population or important subgroups of interest (such as English language learner [ELL] students). For example, if an alternate version of the test is used or testing accommodations are common within the study population, the psychometric properties of the test used by the state might not meet the requirements of the RCT study. Furthermore, tests with and without accommodations and alternate versions of the tests could provide incomparable scores that cannot be analyzed without additional assumptions and adjustments to the statistical models. Perhaps the key issue for RCT evaluations is determining whether the prevalence of accommodations is different for the treatment versus the control group. If accommodations are relatively rare, and no more common among one group than the other, then they might not be much of a concern. If so, data from students taking alternative forms might simply be excluded from the impact analyses.

Comparability of Test Scores Across Grades and States. A final consideration affecting the validity of state assessments scores as outcome measures in RCT evaluations is whether the study involves more than one grade and/or more than one state. In such cases, the work of defining the targeted outcomes is complicated by the fact that each test emphasizes different outcomes (reflecting differences in both grade level and overall state standards). In turn, this variation in outcome measures complicates the appraisal of the alignment between the tests and the intervention. For studies in which the research questions focus on specific skills targeted by the intervention, dissimilarities in the state tests raise serious concerns about the validity of using different tests, given that the objective is to estimate impacts on clearly defined outcomes. Alternatively, when the study involves an intervention intended to have effects on students' ability to meet state standards, variation in state standards and assessments might accurately represent the breadth of outcomes targeted by the intervention. We discuss issues related to multistate and/or multigrade RCTs in more detail later in this section.


4 More sophisticated methods for evaluating alignment between an assessment and an intervention come from research on the alignment of curriculum standards and state assessments. Such studies of alignment generally focus on content match, breadth of knowledge, balance across standards, cognitive demand (that is, challenge level), and the inclusion of irrelevant material (AERA 2003). Norman L. Webb (2007) has developed a process to match curriculum standards and assessments along four criteria related to the categorical concurrence, depth of knowledge, range of knowledge, and balance in content coverage. Porter and colleagues (2008) describe an alignment index that can be used to describe alignment not only between standards and tests, but also for textbooks and even classroom instruction, and that can also be used in analyses.
5 Although this type of differential bias in which the treatment and control group means are under- or overestimated by different degrees will translate into biased impact estimates, there might be instances in which measurement bias exists for both groups but does not translate into a biased impact estimate. For example, if the mean performance of both treatment and control groups is consistently underestimated due to lack of motivation to perform on a low-stakes test, the treatment impact estimate will be unbiased so long as the underestimation of the mean for both treatment and control groups is the same. However, the likelihood of differential bias between treatment and control groups makes this a tenuous assumption.