- Introduction
- Whether to use State Tests in Education Experiments
- How to use State Test Data in Education Experiments
- Whether to Secure Baseline Data
- How to Use Baseline Measures
- Anaylsis of Scale, Proficiency Level, or Other Test Scores
- Combining Results Across Tests for Different Grades or States

- Conclustions and Recommendations
- References
- Appendix A: State Testing Programs Under NCLB
- Appendix B: How NCEE-Funded Evaluations use State Test Data
- List of Tables
- PDF & Related Info

RCT evaluations often involve multiple grades or multiple states (see Appendix B). In such instances, researchers must consider whether to combine results across grades or states, and if so, they must determine the best methods for combining results. In some studies, it might be absolutely necessary to combine results across grades or states in order to achieve sufficient statistical power. In other studies, sample sizes might be sufficient to produce impact estimates separately by grade and state, but an overall estimate might also be desired. Whatever the case, the decision of whether or not to combine results should be made during the planning stages of the research design and analysis plan for estimating program impacts.

Combining results across grades might be an important goal for RCTs in which the intervention is designed to have broad impacts on performance across multiple grades or throughout an entire school or district. Combining results across states might be an important goal when the intervention is intended to have consistent impacts across states and an overall estimate of program impacts across the set of states participating in the study is desired. Whatever the circumstance, researchers must carefully consider differences in state tests and whether combining results across grades and states is appropriate. In this section, we present several methods for combining results across grades or states, and we discuss when these methods are appropriate and when their use might be ill-advised.

*Deciding Whether to Combine Impact Estimates.* Again, in our view, the decision of
whether to combine results across grades or states should be driven, first and foremost,
by the research questions underlying the evaluation. These can be classified into
two categories addressing different goals. First, if the goal of the research is
to demonstrate that an intervention has an effect on students' abilities to meet
state standards, then differences among the standards and assessments of different
grades and states simply reflect intended variation in the targeted outcome (that
is, proficiency on the standards). Arguably, a program that purports to have an
impact on the broad set of knowledge and skills encompassed in state standards should
be able to produce impacts on any state test, in any grade, and policymakers would
want to know about these broad impacts.

Alternatively, if the goal of the research is to demonstrate that an intervention
has an effect on specific skills or knowledge, then combining results across grades
or states might be inappropriate unless each test provides valid, reliable, and
comparable data on the targeted outcomes. More specifically, it is important to
consider whether the standards and assessments are sufficiently similar across grades
and states to support combining results.^{23} If the tests differ substantially in
terms of the knowledge and skills assessed or the difficulty of test items, estimated
program impacts for one grade or in one state might be systematically different
from estimated impacts in another grade or state. From a conservative viewpoint,
differences among tests and the lack of formal psychometric equating may preclude
the rescaling approaches described in this report. Under this perspective, results
should never be combined across grades or states unless the tests can be formally
equated using common items or common populations. Even when an argument can be made
to support pooling data from different state tests, aggregating to produce an overall
impact estimate could mask important variation in measured effects. Therefore, if
there are concerns about such substantive differences in tests, and if statistical
power allows, one should compare across grades and/or states the ability of each
test to provide valid and reliable data on the impacts of the intervention on the
targeted skills. Furthermore, a lack of alignment between a state test and targeted
outcomes might be reason to exclude that state from the study or administer a different
assessment.

Whether the data are similar enough to support combining results is subjective at
this point, although studies with sample sizes large enough to produce separate
estimates for each grade and state might shed light on the extent to which different
standards and assessments are likely to moderate program impacts. In fact, because
we know so little at this point about the influence of differences in state assessments
on possible variations in estimated impacts on targeted outcomes, it might be argued
that research focused on specific skills should only be conducted using an external
assessment or, when state tests are used, with sufficient sample sizes to estimate
impact estimates separately for each grade and state. For studies that involve a
large number of grades and/or states (for example, four states with three grades
each), moderator analyses might be conducted to explore how the alignment (or lack
thereof) between the state assessments and the intervention influences impact estimates.^{24}

*Deciding How to Combine Impact Estimates.* If a decision is made to combine results
across grades or states, there are several analytic approaches to doing so. In our
discussion of these methods, we first focus on strategies for analysis of student-level
data rescaled to a common metric. This is generally the most powerful and efficient
way to combine results across grades and states; however, it also requires strong
assumptions about the comparability of assessments and study samples across grades
and/or states that should be tested explicitly. Following this, we discuss strategies
for combining grade-specific and state-specific impact estimates within a meta-analytic
framework. Generally speaking, meta-analytic strategies treat impact estimates based
on different state tests as generated from separate, independent studies that jointly
provide a distribution of treatment effects. While computationally more intensive,
a strength of meta-analytic approaches is that they can explicitly test the tenability
of assumptions necessary to generate estimates of average treatment effects.

*Rescaling Individual-Level Scores.* The simplest method for producing combined impact
estimates involves running multigrade or multistate analyses after converting test
scores in each grade and state to z-scores (or some other common scale). However,
this approach also imposes the most stringent assumptions on the data. Implied in
this z-score approach are two key assumptions. First is the assumption that differences
in the knowledge and skills tested by the different assessments are inconsequential
in the context of the particular evaluation. That is, either the tests measure the
same content, or differences in content are accepted as reflecting intended variation
in state standards and the desired impact estimate is one which is pooled across
states, despite variation in standards. Second is the assumption that differences
among the tests consist primarily of differences in the scale of the test scores.
In other words, it is assumed (1) that the study sample from each grade and state
represents a similar cross-section of the population of students targeted by the
intervention and (2) that the underlying distributions of scores from each test
are identical, except for differences in the means and standard deviations of the
scale scores.

The plausibility of the first assumption can be evaluated by comparing the demographic characteristics of samples from different grades and states, and also by comparing the means and variances of pretreatment test scores from each grade and state to the respective statewide means and variances of test scores for each grade. For example, relative to the statewide distribution of scores, study samples from different grades and states might be found fairly consistently to have an average baseline test score that is one standard deviation below the state average and a variance that is one-half the magnitude of the statewide variance for that grade.

The simple approach of converting to z-scores separately for each grade and state
can also be adapted to suit conditions in which study samples are heterogeneous,
by standardizing using the statewide means and standard deviations instead of the
sample means and standard deviations.^{25} If the comparisons described above revealed
that the achievement of the study sample was not comparable across grades or states,
the statewide means and standard deviations for each grade could be used to rescale
test scores by grade and state into a cross-state comparable z-score metric by subtracting
the state mean and dividing by the state standard deviation for each grade and state.
For example, the resultant z-scores for a study involving fourth graders from two
states might then have a smaller range in one state, accurately reflecting the more
homogeneous sample from that state.

This method of using state-level means and standard deviations to compare performance and reflect heterogeneity in performance imposes the additional assumption that statewide variance in achievement would be similar within each grade and state if the same vertically scaled test were used for each grade and state. This is because the statewide means and standard deviations are used to rescale the test scores relative to the statewide distribution of achievement. For example, if the study samples from each grade and state were representative of the statewide populations for each grade and state, then the resulting rescaled scores would have a mean of zero and a standard deviation of one for every grade and state. This result implies that any differences in the means or standard deviations across grades and states are artifacts of the tests, and can be suitably removed by rescaling.

Although it is impossible to fully evaluate this assumption, results from the Florida Comprehensive Assessment Test (FCAT) vertical developmental scale suggest that variance in math and reading scores is fairly consistent across narrow grade spans (for example, no more than three grade levels), and that variance in math and reading scores may decrease substantially over larger grade spans (for example, the variance in tenth-grade math scores is less the half the variance of third-grade math scores) (Coxe 2002). Similar patterns can be seen in the vertical scale of the Stanford-9 reading and math tests (Harcourt 1997).

Regarding variation for a single grade across multiple states, data from the National
Assessment of Educational Progress (NAEP) suggest that variation in student achievement
for several subjects in fourth or eighth grade is fairly consistent across groups
or clusters of states (National Center for Education Statistics 2008). Although
there are small but significant differences in within-state achievement variation,
the variance estimate for any one state is typically not significantly different
from the variance estimates for more than half of the states in the nation. This
consistency in variation across grades and states suggests that combining impact
estimates across grades and states in RCTs might be reasonable, so long as the grade
span is not wide (for example, no more than three or perhaps four grades) and so
long as the states included in the study have similar within-state variability on
the NAEP tests.^{26}

In addition to assumptions about consistency in means and variances of student performance,
a second set of assumptions requires that the shape of the distributions of achievement
scores be similar across grades and states. The plausibility of this second assumption
can be evaluated by comparing the shapes of the distributions of pretest scores
from each grade and state through graphical displays (for example, boxplots, histograms,
normalquantile plots). If the distributions of pretest scores appear similar, then
the simple linear transformation to z-scores might be sufficient.^{27}

If the second assumption is violated, and differences in the distributions cannot be attributed to differences in the samples of students (that is, the target population is similarly represented in each grade and state), then a nonlinear transformation of test scores might be a more appropriate option. The most common nonlinear transformation used to link test scores is called equipercentile equating or linking (see Kolen and Brennan 2004).

In its most basic form, the equipercentile equating approach involves first converting test scores to percentile ranks in each grade and state. The percentile ranks are then converted to z-scores by substituting the value of a z-score from the standard normal distribution associated with each percentile rank. As with the linear transformation, the equipercentile approach assumes that differences in content tested are negligible (that is, for impacts on specific skills) or attributable to intentional variation in state standards (that is, for impacts on standards proficiency). Unlike the linear transformation, the equipercentile approach is able to remove

differences in the shapes of the distributions of test scores from different states. Implied in this process is the assumption that distribution differences are due to differences in the concentrations of easy and hard items on each state's test, and not due to actual differences in the distributions of student achievement across states.

*The Importance of a Consistent Reference Population.* The objective of any rescaling
is to place the test scores on a common metric and ensure that the interpretation
of impact estimates is comparable across grades and states. This requires that effect
estimates reflect treatment-control differences not only in a common scale but also
for a common reference population (Dong, Maynard, and Perez-Johnson 2008; Lipsey
and Wilson 2001; Cooper 1998; Hedges and Olkin 1985). For example, a Cohen's d standardized
mean difference (Lipsey 1990; Cooper 1998; Cooper and Hedges 1994; Cohen 1988),
the most common standardized effect size, is highly sensitive to the reference population
whose standard deviation provides the scale for this statistic. In a study in which
the target population is consistently represented across grades and states, converting
to z-scores within grades and states sets the sample standard deviation to 1.0 in
each grade and state, yielding impact estimates in units equal to the estimate of
the within-grade standard deviation of the target population.

If the representation of the study sample varies across grades or states (for example, study participants might be relatively disadvantaged in one state and more broadly representative of the overall student population in another state), and statewide means and standard deviations are used to rescale individual scores to a comparable metric, the resultant differences in the standard deviations of scores in each grade or state reflect differences in representation of the target population. When calculating an overall standardized effect size, the standard deviation of the

sample that best represents the target population^{28} might be used as a divisor to
convert the unstandardized effect size estimate into a rescaled effect size for
the specific population of students targeted by the intervention. This would produce
an effect size in units equal to the standard deviation of the target population.
In any case, without effective rescaling of the individual test scores or calculation
of comparable standardized effect estimates separately for each grade and state,
combining results across grades and states might produce misleading results.

It is therefore important to determine the extent to which individual state samples represent the overall population targeted by the intervention, and when justified, to rescale scores so that a consistent estimate of outcome variability is used to standardize the impact estimate in different states. To rescale scores from distinct assessments and make them directly comparable, evaluation researchers have two main alternatives from which to choose—using the mean and standard deviation of the sample control group versus using the mean and standard deviation of the population of students in the state. In cases in which there is not sufficient comparability of the study sample across states, we recommend using the state distribution. This rescales each student's score to represent his or her performance relative to other students statewide. Because most RCTs involve subpopulations of students instead of a statewide target population, the standard deviation of the rescaled scores will be less than one. Thus, the treatment-control mean difference must still be divided by the standard deviation of the control group (or another estimate of the standard deviation of achievement in the target population) in order to produce a traditional standardized effect size for the target population.

*Combining Impact Estimates Using Meta-Analysis.* Although any method for combining
results across multiple grades or states can be thought of as a variant of meta-analytic
methods, this section focuses on traditional meta-analytic models used to combine
separate impact estimates (Glass, McGraw, and Smith 1981; Cooper and Hedges 1994).
Because effect estimates in meta-analytic models must be on the same scale (for
example, standardized mean differences) it is very likely that the researchers will
need to use a linear or nonlinear linking technique described above to rescale the
test scores or the impact estimates prior to analyses. If linear or nonlinear transformations
of individual test scores cannot be implemented due to differences in the study
samples across grades or states, separate impact analyses should be conducted for
each grade and state, with combined effect size estimates produced only when scores
from different states can be rescaled relative to state-level means and variances
as described above.

Although similar in some respects to pooling individual test scores after rescaling to z-scores or another common metric, a meta-analysis is theoretically different in that data from different state tests are not treated as though they produce equated scores that can be pooled in a traditional analysis. Instead, a meta-analysis provides an estimate of the distribution of treatment effects from different studies. Variation in treatment effects is expected across grades, states, or both, and this variation may be explained by contextual variables that reflect differences in the tests or in study contexts. When the study design and analysis results support the notion that the variation in effect sizes is (a) due to random sampling variation, (b) adequately explained by contextual measures, or (c) ignorable based on the need for an impact estimate that is pooled across different sets of state standards, then an average treatment effect may be produced along with the standard error of the estimate. Otherwise, if the separate impact estimates cannot be pooled, then the distribution of effects may be presented without an average treatment effect. It is because of this ability to explicitly evaluate assumptions that we recommend a meta-analytic approach as opposed to indiscriminately pooling rescaled scores whenever cross-state or cross-grade average impact estimates are sought.

In the classical meta-analytic approach, separate effect estimates are produced
for each grade and state and then combined using weighted average effect estimates
or meta-regression models (including HLM meta-analytic models).^{29} In the weighted
average estimate approach, each grade or state is treated as a separate substudy
with a separate analysis and associated impact estimates. A second stage of analysis
combines the grade-specific or state-specific estimates by averaging. Typically,
the impact estimates from each grade and state are weighted by their associated
sample size or by the inverse of their standard error of estimate.

Alternatively, a meta-regression model could be used to produce a combined effect size. These meta-regression models can be categorized into (a) those that use two stages of analysis in which grade- and state-specific impact estimates are produced in the first stage and then used as the dependent variable in the second stage; or (b) those that rely on a single statistical model in which student-level data are clustered by grade and state via random coefficients (for example, random slopes in HLM models) or fixed coefficients (for example, interactions in a regression model).

Method (a), the meta-regression using two stages, is useful only when there is a large number of impact estimates to be combined (at least 10 but preferably 30 or more) and moderator analyses are required (see footnote 24). Not surprisingly, although this method is often employed in meta-analytic and systematic reviews of literature comprised of many separate studies, it is unlikely to be appropriate for IES-funded RCTs given the relatively small number of grades and states in a single study.

Method (b), which employs a single regression model and uses fixed or random coefficients to distinguish impact estimates across grades or states, might be applicable to most multigrade or multistate RCTs. In this case, it is essential that the test scores from each grade and state be rescaled (as necessary) to a common scale (for example, grade-specific z-scores or adjusted z-scores based on statewide means and standard deviations) so that impact estimates from each grade and state are on the same scale.

In addition, whether to utilize fixed or random coefficients/effects in a meta-analysis is a key consideration. Fixed effects meta-analytic models impose the assumption that a single true impact estimate applies to every grade and state, and that differences in estimated impacts across grades and states are due purely to sampling variation and are not indicative of systematic differences in impacts across states. This approach may employ interaction terms involving the grade, state, and treatment indicators to explicitly test whether treatment effects are different across grades and/or states. Alternatively, random effects meta-analytic models assume that treatment effects will vary across grades and states, usually with the specific assumption that the impact estimates are reflective of a sample of impact estimates drawn from the normally distributed population of impact estimates for the population of contexts. Random effects metaanalytic models also allow one to test and model variance in impact estimates using moderator variables (see footnote 24).

Another distinction between fixed and random effects models is that the former produces results that cannot be generalized beyond the sample of grades and states in the study, whereas the latter considers the grades and states in the study to be a sample from a larger population of grades and states to which the results might be generalized. In studies that include multiple grades and multiple states, it is possible to use fixed effects for grades (because generalization beyond the grades included in the study might not be a goal) and random effects for states (permitting generalization to states not included in the study).

It is also important to recognize that, although the assumptions underlying a random
effects analysis may be plausible in some studies, a limitation of the random effects
analysis is that it requires a relatively large number of grades or states to produce
stable estimates. While a fixed effects analysis can be run with as few as two grades
or states, maximum likelihood estimation of random effects models can become unstable
when the number of states is small (for example, fewer than 10).^{30} Given the routinely
small number of grades and states in IES-funded RCTs, it is likely that fixed effects
methods are most applicable for typical study designs; however, it is important
to recognize that use of fixed effects methods implies that the results will not
be generalized beyond the sample of grades and states in the study.

*Combining Effect Estimates Using Proficiency Scores.* The analysis of proficiency
category scores is even more complicated than the analysis of scaled scores in multistate
or multigrade studies. This is because state assessments have wide variation in
cut points and corresponding levels of difficulty for defining proficiency (NCES
2007). This suggests that the impact estimates for an intervention might depend
on the difficulty of the state test. On the other hand, one might invoke the same
argument posed at several points in this report—if the research questions focus
on the impacts of the intervention on students' proficiency rates, then differences
in the difficulty of the assessments could be ignored because definitions of proficiency
are set by state policy and are indicative of the natural variation in what it means
to be proficient. A less-extreme position would involve a "middle-ground" approach
in which state-specific impact estimates are produced, and variation in impacts
across states is modeled using random effects. In studies involving several states,
this variation may be analyzed to determine how differences in the difficulty of
the tests might moderate program impacts on proficiency rates.