Most state tests produce at least two types of scores: scale scores and proficiency levels. The primary distinction between them is that scale scores are measured on a continuous scale20 while proficiency level scores are measured on an ordinal scale. When considering their use as an outcome measure in an RCT, both scores have advantages and disadvantages.
The primary advantage of scale scores is that they provide greater precision (that is, the ability to distinguish the relative performance of students at the high and low ends of the same proficiency level), which translates to greater statistical power to detect program effects. Although proficiency level scores yield lower statistical power, they do support a more intuitive description of program effects. This is because proficiency levels are not just categorized continuous scores, but rather judgments about what cutoff points indicate substantively meaningful attainment of different levels of proficiency. For example, results from a logistic regression analysis of proficiency level scores might be interpreted as showing that "students who participated in the intervention were two times more likely to score proficient or above on the state test." Arguably, such descriptions of the effects of an intervention might be more easily understood than a mean difference in scale scores or a standardized effect size.21
It is important to note, however, that each state defines proficiency differently, because both the content of tests differs and states' proficiency cut scores vary (Porter, Polikoff, and Smithshon 2008; Petrilli 2008; NCES 2007). This complicates analyses when data come from more than one grade or from multiple states. It could be argued, however, that effects on proficiency rates are still worth measuring across grades and states because proficiency rates are a key focus of federal, state, and district policy. Caution must nevertheless be exercised when interpreting these kinds of results. We discuss this issue further in the section entitled Combining Results Across Tests for Different Grades and/or States.
Some state tests produce additional scores such as normal curve equivalent (NCE) scores, z-scores, T scores, and percentile ranks (see Allen and Yen (1979) for a discussion of common measurement scales). Z-scores and T scores are simply linear transformations of the scale scores (that is, with a different mean and standard deviation), therefore the same issues and methods for scale scores apply to these scores. NCE scores are a non-linear transformation of scale scores that ensures the scores follow a normal distribution with a mean of 50 and a standard deviation of 21. A potential advantage of using one of these popular rescaled scores is that it might serve to place different tests on a common scale, so long as the tests measure the same or similar knowledge and skills. For example, an NCE of 50 on any test corresponds to the average score for the norming sample for that test.22 Even if states do not provide these additional scores, researchers can typically convert scale scores to T or z-scores, as discussed later in this section. Analyzing scores from different tests on a common scale makes it possible to combine results across different grades and even different states under certain assumptions. We discuss these assumptions and additional considerations in combining results across states or grades later in the next section.
Notably, using percentile ranks to estimate treatment effects is usually not advisable, because these scores are on a cumulative scale such that the absolute size of a 10-point difference in percentile rank depends on its location on the scale (for example, moving from the 70th to the 80th percentile represents a larger shift in underlying ability than moving from the 50th to the 60th percentile). When percentile ranks are available, it is possible to convert these scores to z-scores, T scores, or NCEs based on the quantiles of the normal distribution (see Allen and Yen 1979).