- Introduction
- Whether to use State Tests in Education Experiments
- How to use State Test Data in Education Experiments
- Whether to Secure Baseline Data
- How to Use Baseline Measures
- Anaylsis of Scale, Proficiency Level, or Other Test Scores
- Combining Results Across Tests for Different Grades or States

- Conclustions and Recommendations
- References
- Appendix A: State Testing Programs Under NCLB
- Appendix B: How NCEE-Funded Evaluations use State Test Data
- List of Tables
- PDF & Related Info

Most state tests produce at least two types of scores: scale scores and proficiency
levels. The primary distinction between them is that scale scores are measured on
a continuous scale^{20} while proficiency level scores are measured on an ordinal scale.
When considering their use as an outcome measure in an RCT, both scores have advantages
and disadvantages.

The primary advantage of scale scores is that they provide greater precision (that
is, the ability to distinguish the relative performance of students at the high
and low ends of the same proficiency level), which translates to greater statistical
power to detect program effects. Although proficiency level scores yield lower statistical
power, they do support a more intuitive description of program effects. This is
because proficiency levels are not just categorized continuous scores, but rather
judgments about what cutoff points indicate substantively meaningful attainment
of different levels of proficiency. For example, results from a logistic regression
analysis of proficiency level scores might be interpreted as showing that "students
who participated in the intervention were two times more likely to score proficient
or above on the state test." Arguably, such descriptions of the effects of an intervention
might be more easily understood than a mean difference in scale scores or a standardized
effect size.^{21}

It is important to note, however, that each state defines proficiency differently,
because both the content of tests differs and states' proficiency cut scores vary
(Porter, Polikoff, and Smithshon 2008; Petrilli 2008; NCES 2007). This complicates
analyses when data come from more than one grade or from multiple states. It could
be argued, however, that effects on proficiency rates are still worth measuring
across grades and states because proficiency rates are a key focus of federal, state,
and district policy. Caution must nevertheless be exercised when interpreting these
kinds of results. We discuss this issue further in the section entitled *Combining
Results Across Tests for Different Grades and/or States.*

Some state tests produce additional scores such as normal curve equivalent (NCE)
scores, z-scores, T scores, and percentile ranks (see Allen and Yen (1979) for a
discussion of common measurement scales). Z-scores and T scores are simply linear
transformations of the scale scores (that is, with a different mean and standard
deviation), therefore the same issues and methods for scale scores apply to these
scores. NCE scores are a non-linear transformation of scale scores that ensures
the scores follow a normal distribution with a mean of 50 and a standard deviation
of 21. A potential advantage of using one of these popular rescaled scores is that
it might serve to place different tests on a common scale, so long as the tests
measure the same or similar knowledge and skills. For example, an NCE of 50 on any
test corresponds to the average score for the norming sample for that test.^{22} Even
if states do not provide these additional scores, researchers can typically convert
scale scores to T or z-scores, as discussed later in this section. Analyzing scores
from different tests on a common scale makes it possible to combine results across
different grades and even different states under certain assumptions. We discuss
these assumptions and additional considerations in combining results across states
or grades later in the next section.

Notably, using percentile ranks to estimate treatment effects is usually not advisable, because these scores are on a cumulative scale such that the absolute size of a 10-point difference in percentile rank depends on its location on the scale (for example, moving from the 70th to the 80th percentile represents a larger shift in underlying ability than moving from the 50th to the 60th percentile). When percentile ranks are available, it is possible to convert these scores to z-scores, T scores, or NCEs based on the quantiles of the normal distribution (see Allen and Yen 1979).