Skip Navigation
Technical Methods Report: Using State Tests in Education Experiments

NCEE 2009-013
November 2009

Appendix B: How NCEE-Funded Evaluations use State Test Data

As noted, the appeal and ease of using state assessment data for evaluation purposes has grown in recent years. To provide a richer sense of the types of evaluations that use state assessment data and how rigorous evaluations may use such data, we gathered information about studies funded by the National Center for Education Evaluation and Regional Assistance (NCEE) that use state assessment data. We also examined the reasons why research teams viewed state assessments as an appropriate source of outcome data for their studies and the issues they encountered or anticipated in using such data.

1. Which NCEE-Funded Evaluations Use State Data?
To identify a set of rigorous studies that have used or plan to use state assessments, we reviewed study descriptions, unpublished design documents, and published reports (when available) for NCEE-funded evaluations begun or completed during the past five years. Our review included both studies sponsored by NCEE’s evaluation division—34 in total34—as well as randomized control trials (RCTs) being conducted by the NCEE-funded Regional Educational Laboratories (RELs)—24 in total. Notably, our review did not include the investigator-initiated studies funded through Institute of Education Sciences (IES) research grants, which likely include additional examples of rigorous evaluations making use of state assessment data. Gathering information about such studies would have required obtaining unpublished information from dozens of principal investigators, so it was deemed beyond the scope of our review.

Among the 58 studies described above, we identified 21 that planned to use or have used state assessments as a source of outcome measures. These included 12 REL-initiated RCTs and 9

NCEE-sponsored evaluations. Table B.1 (provided at the end of this appendix) provides basic descriptions of these studies.

The studies identified evaluate a diverse set of interventions. These range from system-wide educational reforms—for example, Charter Schools (1)35 and Success in Sight (8)—to broad-based or subject-specific professional development for teachers—for example, Teacher Induction (3), Pacific CHILD (5); and Early Reading PD (6); to subject-specific curricula or instructional programs—for example, Virtual Algebra (9) and AMSTI (14)—to supplementary instructional approaches—for example, After-School EAI (2), MAP Assessment (11), and CASL (13). An important common denominator across many of these evaluations is a focus on improving students’ academic achievement broadly defined and/or students’ general ability to meet academic standards.

Importantly, many of the aforementioned studies are ongoing and, hence, their designs and other details may change by the time the studies are completed and results are published. The main purpose of our review, however, was to help anchor the discussion presented in the main body of this report on issues related to the use of state tests to which research teams had called attention. The examples presented in this appendix are illustrative only. They do not describe the use of state tests in education experiments or the perspectives of education researchers about state test data in a representative way. Consistent with the purposes of our review, we therefore omit the identities of, and details about, individual studies from the discussion that follows.

2. What State Data Are Collected?
Many studies use assessment data from multiple states. Nine of the identified studies were being conducted in, and therefore only collect test data from, a single state. An equal number of studies nevertheless collected or planned to collect data in multiple states, which suggests that considerations about whether it is appropriate to combine impact estimates based on distinct state assessments and how best to do this are relevant for many studies. One of the reviewed studies planned to collect test data from 16 states, and two other studies had collected or planned to collect test data from 13 states each. States in which NCEE-funded studies would be collecting test data included Alabama, Arizona, California, Colorado, Connecticut, Delaware, Florida, Georgia, Hawaii, Illinois, Indiana, Kansas, Kentucky, Louisiana, Massachusetts, Maryland, Maine, Michigan, Minnesota, Missouri, New Jersey, New Mexico, New York, North Carolina, Ohio, Oregon, Pennsylvania, Texas, Utah, Vermont, and Wisconsin.

Not surprisingly, studies examine intervention effects on grades commonly tested. Consistent with No Child Left Behind (NCLB) requirements and state assessment policies, most studies used state test data to examine intervention outcomes in those grades in which students must be tested yearly (that is, grades three to eight). Some studies included second grade students in their research samples but often excluded these students from analyses of achievement impacts, because state assessment data are unavailable. (These students and/or classrooms are nevertheless included in other study components—for example, analyses of intervention impacts on teacher practices.) Several studies examined the impacts of an intervention tested across multiple grades. Researchers must therefore also consider whether to aggregate impact estimates based on the distinct assessments administered to students in different grades, within a given state, or across multiple states.

Studies also tend to examine impacts on commonly tested subjects. Most studies planned to collect students’ test scores in locally or state-administered mathematics and/or reading assessments, reflecting the focus of the intervention being evaluated. Some studies would examine impacts on both reading and mathematics test scores; however, the analyses of these test scores are generally distinct. Reflecting the gradual expansion of state testing to additional subjects, one study planned to collect student scores on the state’s science test, in addition to scores on the state tests in reading and mathematics.

Several studies collect state test scores in addition to administering their own assessments. This could make it possible to examine whether impact estimates based on state assessment data are consistent with results based on a common study-administered assessment.

3. How Do Studies Use State Assessment Data?
In reviewing how NCEE-funded studies have used or plan to use state assessment data, we focused on two main issues: (1) the types of scores examined; and (2) how impact estimates were computed and, if applicable, aggregated across grades and/or states.

Studies generally examine overall achievement scores in a given subject area. These can be reported in several different metrics including scale scores, normal curve equivalents, and percentile rankings. Some studies planned to estimate impacts both on scale scores (or other continuous measures of achievement) and on proficiency rates. One study estimated impacts on scale scores and on the percentage of students achieving above or below the districts’ pre-intervention average reading test score. A few studies planned to estimate impacts on subscales of achievement.

Impacts are commonly estimated in effect size units. Because test scores across grades and/or states rarely share a common scale, most research teams planned to standardize test scores to have a common mean of zero and a standard deviation of one (that is, convert them to z-scores). Such standardization would enable researchers to describe impact estimates as effect sizes, which facilitates the comparison or aggregation of impact estimates based on distinct assessments. The scale scores (or other continuous measures of achievement examined) for students in a given grade and state are converted to z-scores by subtracting the mean score for that grade/test and dividing this difference (or deviation from the mean) by the standard deviation of scores for that grade/test.

Studies use different standardization strategies. To convert the scale scores into z-scores, some studies used sample-based estimates of the means and standard deviations for students taking a given assessment in a given grade. Other studies used the state-reported means and standard deviations reported for the overall student population. Such decisions are important as they influence the precision of impact estimates (since sample-based parameter estimates are typically less precise because they are based on smaller sample sizes). They also influence the interpretation of impact estimates—relative to the distribution of achievement for students or schools similar to those included in the study or relative to the distribution of achievement for a broader, statewide student population.

Many studies aggregate effect size impact estimates across states and/or grades. One study conducted all treatment-control comparisons using z-scores for students taking the same tests (that is, within the same grade and district) to ensure that treatment status was not confounded with properties of the test(s). Impact estimates were then aggregated across districts and grades to generate overall estimates for the intervention. Two other studies planned to treat individual states as separate samples, estimate impacts within each state, and then combine the separate impact estimates across states for an estimate of the overall effect of the intervention. Notably, for studies still underway, study design and analysis documents did not always specify how effect size estimates would be combined (that is, as simple or weighted averages).

A few studies do not aggregate effect size estimates, reflecting unique design features. For instance, one study planned to conduct analyses separately for grades four and five because participating schools were randomly assigned to test the intervention in one grade or the other. Another study examining state tests scores for students across several consecutive grades planned analysis focusing on each grade separately since students in these various grades could have different amounts of exposure to the intervention being evaluated.

Other studies examine impacts using vertically scaled assessments. The availability of vertically equated scores was expected to enable one study team to analyze together the scores of students in two consecutive grades and estimate the intervention’s effects on academic performance for both grades combined. Another study planned to estimate impacts as (yearly) deviations from the average trajectory of learning for students across five grades.

4. Why Did Studies View State Tests as an Appropriate Source of Outcome Measures?
Although information was not uniformly available, study documents sometimes included statements about the reasons why research teams chose to use state tests in their studies. Research teams cited the following reasons.

Potential for Greater Policy Relevance. One study team noted that, while district-administered test scores may not cover every relevant domain of student achievement, they captured the content that schools deem most important or worthy of assessing. Documents for another study indicated that the study would be estimating impacts using state data because educational authorities care about student performance in these high-stakes tests. A third study noted that using state assessments would enable researchers to estimate the extent to which program implementation influences student achievement relative to NCLB goals. Researchers for a fourth study described the state assessment as a policy-relevant achievement measure.

Minimize Test Burden. One study team anticipated that relying on locally administered tests would help overcome likely resistance to additional student testing by participating entities, as well as reduce evaluation costs. Another study team viewed using state test data as facilitating school recruitment, because no additional test administrations had to be required from participating schools or students.

Possible Greater Reliability of Achievement Measures. In addition to administering a brief, common assessment to all study participants, which was the main source of outcome measures, one study team also collected state test scores. This team anticipated that the state-administered tests were more likely to be “full battery” and, therefore, might measure achievement more reliably.

Alignment with the Intervention. One study team indicated that state standards mandate the teaching of concepts that represented the focus of the intervention being evaluated. The state tests, in turn, were expected to be aligned with the required curriculum content and, therefore, with the intervention being evaluated.

5. What Challenges Were Anticipated in Using State Test Data?
Study documents identified several important challenges related to using state tests. Although these fail to represent an exhaustive inventory, they provide a sense of the issues that researchers worry about when using state tests for evaluation purposes. Challenges mentioned explicitly in study documents included the following.

Estimation and Interpretation of Aggregated Impact Estimates. The fact that locally administered tests vary in their scales as well as in the subjects and content covered poses important evaluation challenges. For example, one study estimated treatment-control differences “within grade and district”—that is, test scores were standardized to describe student performance relative to other students in the same grade and district—in order to provide a common standard for treatment and control groups across all study sites. Another study team noted that using proficiency percentages and standardized scores would yield common outcomes but would not make the meaning of the measures uniform across states; hence, the pooled impact of the intervention should be interpreted as the average impact of the intervention on the skills measured by the individual state assessments. A third study team noted that effect size impact estimates need to be interpreted in terms of the variance in scores on each state assessment.

Possible Floor Effects. Especially for interventions targeting English language learner (ELL) students, researchers sometimes expressed concern about the ability of standardized tests to capture improvements in student outcomes. One study team, for example, expressed concern that some study participants might “bottom out” on the state tests making it difficult to record meaningful learning gains. For this reason, the study planned to student performance over multiple years and according to students’ starting achievement levels. Another study team expressed concerns that the state ELA test might be subject to floor effects in study locations where most students are not native speakers of English and/or classroom use of English is limited. For this reason, the study team planned to collect data on, and estimate impacts using, both state-administered ELA and English as a second language (ESL) assessments. Similar concerns applied to other studies focused on low-performing students.

Insufficient Alignment with the Intervention. One study team hypothesized that misalignment could help account for the absence of an impact on student achievement, although the study could not test this hypothesis directly since scores on an alternative, more closely aligned test were not available. Another study team ultimately determined that the state test was not as well aligned with the intervention as originally thought, because it was designed to cover the broad set of skills covered in the state’s curriculum, while the intervention focuses on particular concepts. The limited coverage of these concepts in the state’s standardized test would reduce the study’s ability to detect impacts.

Missing Data. Another key concern in collecting and analyzing state test data is the potential for different participation rates among treatment and control group members, especially if exposure to the intervention influences which students participate in standardized testing. Such concerns were raised by two study teams. Study documents suggested that investigators planned to look for evidence of differential participation rates as a possible source of bias in impact estimates, but did not specify how potential biases might be addressed.


34 See for links to descriptions of ongoing NCEE research projects.