As defined earlier, reliability is the degree to which an assessment provides scores that are sufficiently free of random measurement error that they can be used to detect program effects. One general mathematical representation of reliability is:
where var(Y) is the total variance in the outcome Y, and var(ε) is the error variability in Y.6 In other words, reliability is the proportion of variance in an outcome that is not measurement error.7
Reliability is important in the context of RCT studies because it influences the statistical power of treatment-control comparisons (Zimmerman and Williams 1986). For example, an experiment with 80 percent power to detect a 0.20 standardized mean difference in true scores is reduced to having power of 71 percent if the actual reliability of the outcome measure is 0.80 (that is, the uncorrected minimum detectable effect [MDE] is 0.20 while the MDE after adjusting for unreliability is 0.22).8
Nearly all standardized tests, including state assessments, have published estimates of test score reliability and/or standard errors of measurement. Both of these statistics are indicators of the precision of test scores—standard errors are the reciprocal of precision—and are usually found in the technical manuals published by the test developer or the state department of education.
What is not usually reported in technical manuals are conditional reliabilities or conditional standard errors of measurement, which show how the precision of test scores changes depending on the value of the score (Lord and Novick 1968; Hambleton and Swaminathan 1984). For most assessments, reliability is maximized near the average score on norm-referenced tests or near performance cut-scores on criterion referenced tests, with a downward curve as scores move away from the average or cut-scores (Hambleton, Swaminathan, and Rogers 1991). This suggests that the reliability of relatively high or relatively low scores can be much worse than the reliability of scores near the cut-points or the average score on a state assessment. When the performance of students is high enough or low enough to produce ceiling or floor effects,9 the reliabilities of assessments can be reduced dramatically. In fact, a test would have no reliability if every student in the sample got all items correct or incorrect, which also makes it impossible to detect treatment effects.
This decreased reliability of high and low test scores has important implications for evaluations in which the intervention is focused on relatively high- or low-performing students. In these cases, the state assessment might be an ineffective measure of achievement for the population of students targeted by the intervention. For high-performing students, the state test for their grade level may be too easy, whereas for low-performing students it may be too difficult. Again, this drives down the reliability of the state test scores and reduces the study's power to detect an effect of the program.
Because conditional reliabilities or conditional standard errors of measurement are rarely published, it can be difficult to ascertain whether a state assessment can be expected to produce reliable scores for a particular study population or subpopulation of interest. To make this determination, researchers should consult with the state office of assessment or the test developer to determine whether the proposed test is unusually hard or easy for the population of students targeted by the intervention.11