Technical Methods Report: Using State Tests in Education Experiments - Anaylsis of Scale, Proficiency Level, or Other Test Scores

Technical Methods Report: Using State Tests in Education Experiments
A Discussion of the Issues

NCEE 2009-013
November 2009

Introduction
Whether to use State Tests in Education Experiments
How to use State Test Data in Education Experiments
- Whether to Secure Baseline Data
- How to Use Baseline Measures
- Anaylsis of Scale, Proficiency Level, or Other Test Scores
- Combining Results Across Tests for Different Grades or States
Conclustions and Recommendations
References
Appendix A: State Testing Programs Under NCLB
Appendix B: How NCEE-Funded Evaluations use State Test Data
List of Tables
PDF & Related Info

Anaylsis of Scale, Proficiency Level, or Other Test Scores

Most state tests produce at least two types of scores: scale scores and proficiency levels. The primary distinction between them is that scale scores are measured on a continuous scale²⁰ while proficiency level scores are measured on an ordinal scale. When considering their use as an outcome measure in an RCT, both scores have advantages and disadvantages.

The primary advantage of scale scores is that they provide greater precision (that is, the ability to distinguish the relative performance of students at the high and low ends of the same proficiency level), which translates to greater statistical power to detect program effects. Although proficiency level scores yield lower statistical power, they do support a more intuitive description of program effects. This is because proficiency levels are not just categorized continuous scores, but rather judgments about what cutoff points indicate substantively meaningful attainment of different levels of proficiency. For example, results from a logistic regression analysis of proficiency level scores might be interpreted as showing that "students who participated in the intervention were two times more likely to score proficient or above on the state test." Arguably, such descriptions of the effects of an intervention might be more easily understood than a mean difference in scale scores or a standardized effect size.²¹

It is important to note, however, that each state defines proficiency differently, because both the content of tests differs and states' proficiency cut scores vary (Porter, Polikoff, and Smithshon 2008; Petrilli 2008; NCES 2007). This complicates analyses when data come from more than one grade or from multiple states. It could be argued, however, that effects on proficiency rates are still worth measuring across grades and states because proficiency rates are a key focus of federal, state, and district policy. Caution must nevertheless be exercised when interpreting these kinds of results. We discuss this issue further in the section entitled Combining Results Across Tests for Different Grades and/or States.

Some state tests produce additional scores such as normal curve equivalent (NCE) scores, z-scores, T scores, and percentile ranks (see Allen and Yen (1979) for a discussion of common measurement scales). Z-scores and T scores are simply linear transformations of the scale scores (that is, with a different mean and standard deviation), therefore the same issues and methods for scale scores apply to these scores. NCE scores are a non-linear transformation of scale scores that ensures the scores follow a normal distribution with a mean of 50 and a standard deviation of 21. A potential advantage of using one of these popular rescaled scores is that it might serve to place different tests on a common scale, so long as the tests measure the same or similar knowledge and skills. For example, an NCE of 50 on any test corresponds to the average score for the norming sample for that test.²² Even if states do not provide these additional scores, researchers can typically convert scale scores to T or z-scores, as discussed later in this section. Analyzing scores from different tests on a common scale makes it possible to combine results across different grades and even different states under certain assumptions. We discuss these assumptions and additional considerations in combining results across states or grades later in the next section.

Notably, using percentile ranks to estimate treatment effects is usually not advisable, because these scores are on a cumulative scale such that the absolute size of a 10-point difference in percentile rank depends on its location on the scale (for example, moving from the 70th to the 80th percentile represents a larger shift in underlying ability than moving from the 50th to the 60th percentile). When percentile ranks are available, it is possible to convert these scores to z-scores, T scores, or NCEs based on the quantiles of the normal distribution (see Allen and Yen 1979).

Top

²⁰ Some measurement experts might be more specific and claim that scale scores are usually measured on an interval scale, which suggests that the intervals between scores are equivalent throughout the full range of scores. In other words, a difference of one point reflects the same degree of difference in knowledge or skills regardless of whether the difference is observed at the low end or the high end of a scale. In truth, scale scores are interval scaled in theory and might not actually be perfectly interval scaled in practice.
²¹ Because proficiency level scores are simply a categorization of the scale scores, one analysis option involves a staged analysis in which the first stage of analyses uses scale scores. Then, if a significant program effect is revealed, the second stage of analyses uses proficiency level scores in order to improve interpretability of the results.
²² A norming sample is the sample of students from the tested population that were included in the original calibration and scaling of the test. For state assessments, the norming sample is representative of the population of students in the state for whom that version of the test was written.