- Introduction
- Whether to use State Tests in Education Experiments
- Assessing the Validity of State Assessments for Evaluation Purposes
- Assessing the Reliability of State Assessments
- Assessing the Feasibility of Collecting State Test Data

- How to use State Test Data in Education Experiments
- Conclustions and Recommendations
- References
- Appendix A: State Testing Programs Under NCLB
- Appendix B: How NCEE-Funded Evaluations use State Test Data
- List of Tables
- PDF & Related Info

As defined earlier, reliability is the degree to which an assessment provides scores that are sufficiently free of random measurement error that they can be used to detect program effects. One general mathematical representation of reliability is:

where var(Y) is the total variance in the outcome Y, and var(ε) is the error
variability in Y.^{6} In other words, reliability
is the proportion of variance in an outcome that is not measurement error.^{7}

Reliability is important in the context of RCT studies because it influences the
statistical power of treatment-control comparisons (Zimmerman and Williams 1986).
For example, an experiment with 80 percent power to detect a 0.20 standardized mean
difference in true scores is reduced to having power of 71 percent if the actual
reliability of the outcome measure is 0.80 (that is, the uncorrected minimum detectable
effect [MDE] is 0.20 while the MDE after adjusting for unreliability is 0.22).^{8}

Nearly all standardized tests, including state assessments, have published estimates of test score reliability and/or standard errors of measurement. Both of these statistics are indicators of the precision of test scores—standard errors are the reciprocal of precision—and are usually found in the technical manuals published by the test developer or the state department of education.

What is not usually reported in technical manuals are conditional reliabilities
or conditional standard errors of measurement, which show how the precision of test
scores changes depending on the value of the score (Lord and Novick 1968; Hambleton
and Swaminathan 1984). For most assessments, reliability is maximized near the average
score on norm-referenced tests or near performance cut-scores on criterion referenced
tests, with a downward curve as scores move away from the average or cut-scores
(Hambleton, Swaminathan, and Rogers 1991). This suggests that the reliability of
relatively high or relatively low scores can be much worse than the reliability
of scores near the cut-points or the average score on a state assessment. When the
performance of students is high enough or low enough to produce ceiling or floor
effects,^{9} the reliabilities of assessments
can be reduced dramatically. In fact, a test would have no reliability if every
student in the sample got all items correct or incorrect, which also makes it impossible
to detect treatment effects.^{}

This decreased reliability of high and low test scores has important implications
for evaluations in which the intervention is focused on relatively high- or low-performing
students. In these cases, the state assessment might be an ineffective measure of
achievement for the population of students targeted by the intervention. For high-performing
students, the state test *for their grade level* may be too easy, whereas
for low-performing students it may be too difficult. Again, this drives down the
reliability of the state test scores and reduces the study's power to detect an
effect of the program.

Because conditional reliabilities or conditional standard errors of measurement
are rarely published, it can be difficult to ascertain whether a state assessment
can be expected to produce reliable scores for a particular study population or
subpopulation of interest. To make this determination, researchers should consult
with the state office of assessment or the test developer to determine whether the
proposed test is unusually hard or easy for the population of students targeted
by the intervention.^{11}