Technical Methods Report: Using State Tests in Education Experiments - Assessing the Reliability of State Assessments

Technical Methods Report: Using State Tests in Education Experiments
A Discussion of the Issues

NCEE 2009-013
November 2009

Assessing the Reliability of State Assessments

As defined earlier, reliability is the degree to which an assessment provides scores that are sufficiently free of random measurement error that they can be used to detect program effects. One general mathematical representation of reliability is:

where var(Y) is the total variance in the outcome Y, and var(ε) is the error variability in Y.⁶ In other words, reliability is the proportion of variance in an outcome that is not measurement error.⁷

Reliability is important in the context of RCT studies because it influences the statistical power of treatment-control comparisons (Zimmerman and Williams 1986). For example, an experiment with 80 percent power to detect a 0.20 standardized mean difference in true scores is reduced to having power of 71 percent if the actual reliability of the outcome measure is 0.80 (that is, the uncorrected minimum detectable effect [MDE] is 0.20 while the MDE after adjusting for unreliability is 0.22).⁸

Nearly all standardized tests, including state assessments, have published estimates of test score reliability and/or standard errors of measurement. Both of these statistics are indicators of the precision of test scores—standard errors are the reciprocal of precision—and are usually found in the technical manuals published by the test developer or the state department of education.

What is not usually reported in technical manuals are conditional reliabilities or conditional standard errors of measurement, which show how the precision of test scores changes depending on the value of the score (Lord and Novick 1968; Hambleton and Swaminathan 1984). For most assessments, reliability is maximized near the average score on norm-referenced tests or near performance cut-scores on criterion referenced tests, with a downward curve as scores move away from the average or cut-scores (Hambleton, Swaminathan, and Rogers 1991). This suggests that the reliability of relatively high or relatively low scores can be much worse than the reliability of scores near the cut-points or the average score on a state assessment. When the performance of students is high enough or low enough to produce ceiling or floor effects,⁹ the reliabilities of assessments can be reduced dramatically. In fact, a test would have no reliability if every student in the sample got all items correct or incorrect, which also makes it impossible to detect treatment effects.

This decreased reliability of high and low test scores has important implications for evaluations in which the intervention is focused on relatively high- or low-performing students. In these cases, the state assessment might be an ineffective measure of achievement for the population of students targeted by the intervention. For high-performing students, the state test for their grade level may be too easy, whereas for low-performing students it may be too difficult. Again, this drives down the reliability of the state test scores and reduces the study's power to detect an effect of the program.

Because conditional reliabilities or conditional standard errors of measurement are rarely published, it can be difficult to ascertain whether a state assessment can be expected to produce reliable scores for a particular study population or subpopulation of interest. To make this determination, researchers should consult with the state office of assessment or the test developer to determine whether the proposed test is unusually hard or easy for the population of students targeted by the intervention.¹¹

Top

⁶ Classical true-score measurement theory states that observed scores comprise two components: (1) a true-score component, which reflects the true performance of the individual; and (2) a random measurement error component. Estimates of reliability seek to partition total variance in observed scores into true-score and error components.
⁷ Common techniques for estimating reliability include Cronbach's alpha, split-half, and test-retest reliability. Each technique uses correlations between items and/or overall scores to estimate the proportion of observed score variance that is not attributable to measurement error. Note that reliability requires both precision and variability in scores. Correlation-based measures are undefined, corresponding to a total absence of reliability, if every student receives the same score (that is, var(Y) = 0). Thus, a test must be developmentally appropriate for participating students in order for the data to be reliable.
⁸ Most power analyses for RCT studies do not explicitly account for unreliability in the outcome measure. Instead, the usual practice is to specify an MDE size relative to the total observed variance in an outcome measure. Assuming that the treatment has no effect on measurement error (which must be true if the measurement error is random noise), then the actual MDE of a study is equal to the unadjusted MDE divided by the square root of the reliability. It should also be noted that a few studies have argued that greater reliability can actually decrease power, but this is true only if there is a simultaneous increase in true score variability (Zimmerman and Williams 1986).
⁹ A ceiling effect occurs when many students get every item correct, so that it is difficult to make distinctions among high-performing students. Likewise, a floor effect occurs when many students get every item incorrect, making it difficult to distinguish among low-performing students.
¹⁰ In this situation, total variance would be zero and the general formula for reliability and correlation-based measures of reliability would be undefined, as discussed above.
¹¹ If data are available prior to implementing the intervention, scatterplots of pretest data for the study population may be used to identify potential ceiling and floor effects. See Cronin (2005, p. 10) for an example of scatterplots showing ceiling and floor effects. If data are not available, a simulation study should be conducted based upon hypothetical distributions of scores and conditional reliabilities for the target population. Such a study would extend the typical Monte Carlo power simulation (May 2005; Muthén and Muthén 2002) to include (a) an error term whose variance increases as scores move away from the mean to represent variation in reliability, and (b) maximum and minimum values to represent ceiling and/or floor effects. Such a simulation would reveal power to detect effects given the conditional reliability of the test and the likely occurrence of ceiling or floor effects. Furthermore, grade equivalent scores or vertically scaled scores, if available, could be used to inform these analyses by pinpointing the likely performance of students in the target population (for example, students targeted by the intervention might be those who score between one and two grade levels below their current grade).