Skip Navigation
Technical Methods Report: Using State Tests in Education Experiments

NCEE 2009-013
November 2009

Combining Results Across Tests for Different Grades or States

RCT evaluations often involve multiple grades or multiple states (see Appendix B). In such instances, researchers must consider whether to combine results across grades or states, and if so, they must determine the best methods for combining results. In some studies, it might be absolutely necessary to combine results across grades or states in order to achieve sufficient statistical power. In other studies, sample sizes might be sufficient to produce impact estimates separately by grade and state, but an overall estimate might also be desired. Whatever the case, the decision of whether or not to combine results should be made during the planning stages of the research design and analysis plan for estimating program impacts.

Combining results across grades might be an important goal for RCTs in which the intervention is designed to have broad impacts on performance across multiple grades or throughout an entire school or district. Combining results across states might be an important goal when the intervention is intended to have consistent impacts across states and an overall estimate of program impacts across the set of states participating in the study is desired. Whatever the circumstance, researchers must carefully consider differences in state tests and whether combining results across grades and states is appropriate. In this section, we present several methods for combining results across grades or states, and we discuss when these methods are appropriate and when their use might be ill-advised.

Deciding Whether to Combine Impact Estimates. Again, in our view, the decision of whether to combine results across grades or states should be driven, first and foremost, by the research questions underlying the evaluation. These can be classified into two categories addressing different goals. First, if the goal of the research is to demonstrate that an intervention has an effect on students' abilities to meet state standards, then differences among the standards and assessments of different grades and states simply reflect intended variation in the targeted outcome (that is, proficiency on the standards). Arguably, a program that purports to have an impact on the broad set of knowledge and skills encompassed in state standards should be able to produce impacts on any state test, in any grade, and policymakers would want to know about these broad impacts.

Alternatively, if the goal of the research is to demonstrate that an intervention has an effect on specific skills or knowledge, then combining results across grades or states might be inappropriate unless each test provides valid, reliable, and comparable data on the targeted outcomes. More specifically, it is important to consider whether the standards and assessments are sufficiently similar across grades and states to support combining results.23 If the tests differ substantially in terms of the knowledge and skills assessed or the difficulty of test items, estimated program impacts for one grade or in one state might be systematically different from estimated impacts in another grade or state. From a conservative viewpoint, differences among tests and the lack of formal psychometric equating may preclude the rescaling approaches described in this report. Under this perspective, results should never be combined across grades or states unless the tests can be formally equated using common items or common populations. Even when an argument can be made to support pooling data from different state tests, aggregating to produce an overall impact estimate could mask important variation in measured effects. Therefore, if there are concerns about such substantive differences in tests, and if statistical power allows, one should compare across grades and/or states the ability of each test to provide valid and reliable data on the impacts of the intervention on the targeted skills. Furthermore, a lack of alignment between a state test and targeted outcomes might be reason to exclude that state from the study or administer a different assessment.

Whether the data are similar enough to support combining results is subjective at this point, although studies with sample sizes large enough to produce separate estimates for each grade and state might shed light on the extent to which different standards and assessments are likely to moderate program impacts. In fact, because we know so little at this point about the influence of differences in state assessments on possible variations in estimated impacts on targeted outcomes, it might be argued that research focused on specific skills should only be conducted using an external assessment or, when state tests are used, with sufficient sample sizes to estimate impact estimates separately for each grade and state. For studies that involve a large number of grades and/or states (for example, four states with three grades each), moderator analyses might be conducted to explore how the alignment (or lack thereof) between the state assessments and the intervention influences impact estimates.24

Deciding How to Combine Impact Estimates. If a decision is made to combine results across grades or states, there are several analytic approaches to doing so. In our discussion of these methods, we first focus on strategies for analysis of student-level data rescaled to a common metric. This is generally the most powerful and efficient way to combine results across grades and states; however, it also requires strong assumptions about the comparability of assessments and study samples across grades and/or states that should be tested explicitly. Following this, we discuss strategies for combining grade-specific and state-specific impact estimates within a meta-analytic framework. Generally speaking, meta-analytic strategies treat impact estimates based on different state tests as generated from separate, independent studies that jointly provide a distribution of treatment effects. While computationally more intensive, a strength of meta-analytic approaches is that they can explicitly test the tenability of assumptions necessary to generate estimates of average treatment effects.

Rescaling Individual-Level Scores. The simplest method for producing combined impact estimates involves running multigrade or multistate analyses after converting test scores in each grade and state to z-scores (or some other common scale). However, this approach also imposes the most stringent assumptions on the data. Implied in this z-score approach are two key assumptions. First is the assumption that differences in the knowledge and skills tested by the different assessments are inconsequential in the context of the particular evaluation. That is, either the tests measure the same content, or differences in content are accepted as reflecting intended variation in state standards and the desired impact estimate is one which is pooled across states, despite variation in standards. Second is the assumption that differences among the tests consist primarily of differences in the scale of the test scores. In other words, it is assumed (1) that the study sample from each grade and state represents a similar cross-section of the population of students targeted by the intervention and (2) that the underlying distributions of scores from each test are identical, except for differences in the means and standard deviations of the scale scores.

The plausibility of the first assumption can be evaluated by comparing the demographic characteristics of samples from different grades and states, and also by comparing the means and variances of pretreatment test scores from each grade and state to the respective statewide means and variances of test scores for each grade. For example, relative to the statewide distribution of scores, study samples from different grades and states might be found fairly consistently to have an average baseline test score that is one standard deviation below the state average and a variance that is one-half the magnitude of the statewide variance for that grade.

The simple approach of converting to z-scores separately for each grade and state can also be adapted to suit conditions in which study samples are heterogeneous, by standardizing using the statewide means and standard deviations instead of the sample means and standard deviations.25 If the comparisons described above revealed that the achievement of the study sample was not comparable across grades or states, the statewide means and standard deviations for each grade could be used to rescale test scores by grade and state into a cross-state comparable z-score metric by subtracting the state mean and dividing by the state standard deviation for each grade and state. For example, the resultant z-scores for a study involving fourth graders from two states might then have a smaller range in one state, accurately reflecting the more homogeneous sample from that state.

This method of using state-level means and standard deviations to compare performance and reflect heterogeneity in performance imposes the additional assumption that statewide variance in achievement would be similar within each grade and state if the same vertically scaled test were used for each grade and state. This is because the statewide means and standard deviations are used to rescale the test scores relative to the statewide distribution of achievement. For example, if the study samples from each grade and state were representative of the statewide populations for each grade and state, then the resulting rescaled scores would have a mean of zero and a standard deviation of one for every grade and state. This result implies that any differences in the means or standard deviations across grades and states are artifacts of the tests, and can be suitably removed by rescaling.

Although it is impossible to fully evaluate this assumption, results from the Florida Comprehensive Assessment Test (FCAT) vertical developmental scale suggest that variance in math and reading scores is fairly consistent across narrow grade spans (for example, no more than three grade levels), and that variance in math and reading scores may decrease substantially over larger grade spans (for example, the variance in tenth-grade math scores is less the half the variance of third-grade math scores) (Coxe 2002). Similar patterns can be seen in the vertical scale of the Stanford-9 reading and math tests (Harcourt 1997).

Regarding variation for a single grade across multiple states, data from the National Assessment of Educational Progress (NAEP) suggest that variation in student achievement for several subjects in fourth or eighth grade is fairly consistent across groups or clusters of states (National Center for Education Statistics 2008). Although there are small but significant differences in within-state achievement variation, the variance estimate for any one state is typically not significantly different from the variance estimates for more than half of the states in the nation. This consistency in variation across grades and states suggests that combining impact estimates across grades and states in RCTs might be reasonable, so long as the grade span is not wide (for example, no more than three or perhaps four grades) and so long as the states included in the study have similar within-state variability on the NAEP tests.26

In addition to assumptions about consistency in means and variances of student performance, a second set of assumptions requires that the shape of the distributions of achievement scores be similar across grades and states. The plausibility of this second assumption can be evaluated by comparing the shapes of the distributions of pretest scores from each grade and state through graphical displays (for example, boxplots, histograms, normalquantile plots). If the distributions of pretest scores appear similar, then the simple linear transformation to z-scores might be sufficient.27

If the second assumption is violated, and differences in the distributions cannot be attributed to differences in the samples of students (that is, the target population is similarly represented in each grade and state), then a nonlinear transformation of test scores might be a more appropriate option. The most common nonlinear transformation used to link test scores is called equipercentile equating or linking (see Kolen and Brennan 2004).

In its most basic form, the equipercentile equating approach involves first converting test scores to percentile ranks in each grade and state. The percentile ranks are then converted to z-scores by substituting the value of a z-score from the standard normal distribution associated with each percentile rank. As with the linear transformation, the equipercentile approach assumes that differences in content tested are negligible (that is, for impacts on specific skills) or attributable to intentional variation in state standards (that is, for impacts on standards proficiency). Unlike the linear transformation, the equipercentile approach is able to remove

differences in the shapes of the distributions of test scores from different states. Implied in this process is the assumption that distribution differences are due to differences in the concentrations of easy and hard items on each state's test, and not due to actual differences in the distributions of student achievement across states.

The Importance of a Consistent Reference Population. The objective of any rescaling is to place the test scores on a common metric and ensure that the interpretation of impact estimates is comparable across grades and states. This requires that effect estimates reflect treatment-control differences not only in a common scale but also for a common reference population (Dong, Maynard, and Perez-Johnson 2008; Lipsey and Wilson 2001; Cooper 1998; Hedges and Olkin 1985). For example, a Cohen's d standardized mean difference (Lipsey 1990; Cooper 1998; Cooper and Hedges 1994; Cohen 1988), the most common standardized effect size, is highly sensitive to the reference population whose standard deviation provides the scale for this statistic. In a study in which the target population is consistently represented across grades and states, converting to z-scores within grades and states sets the sample standard deviation to 1.0 in each grade and state, yielding impact estimates in units equal to the estimate of the within-grade standard deviation of the target population.

If the representation of the study sample varies across grades or states (for example, study participants might be relatively disadvantaged in one state and more broadly representative of the overall student population in another state), and statewide means and standard deviations are used to rescale individual scores to a comparable metric, the resultant differences in the standard deviations of scores in each grade or state reflect differences in representation of the target population. When calculating an overall standardized effect size, the standard deviation of the

sample that best represents the target population28 might be used as a divisor to convert the unstandardized effect size estimate into a rescaled effect size for the specific population of students targeted by the intervention. This would produce an effect size in units equal to the standard deviation of the target population. In any case, without effective rescaling of the individual test scores or calculation of comparable standardized effect estimates separately for each grade and state, combining results across grades and states might produce misleading results.

It is therefore important to determine the extent to which individual state samples represent the overall population targeted by the intervention, and when justified, to rescale scores so that a consistent estimate of outcome variability is used to standardize the impact estimate in different states. To rescale scores from distinct assessments and make them directly comparable, evaluation researchers have two main alternatives from which to choose—using the mean and standard deviation of the sample control group versus using the mean and standard deviation of the population of students in the state. In cases in which there is not sufficient comparability of the study sample across states, we recommend using the state distribution. This rescales each student's score to represent his or her performance relative to other students statewide. Because most RCTs involve subpopulations of students instead of a statewide target population, the standard deviation of the rescaled scores will be less than one. Thus, the treatment-control mean difference must still be divided by the standard deviation of the control group (or another estimate of the standard deviation of achievement in the target population) in order to produce a traditional standardized effect size for the target population.

Combining Impact Estimates Using Meta-Analysis. Although any method for combining results across multiple grades or states can be thought of as a variant of meta-analytic methods, this section focuses on traditional meta-analytic models used to combine separate impact estimates (Glass, McGraw, and Smith 1981; Cooper and Hedges 1994). Because effect estimates in meta-analytic models must be on the same scale (for example, standardized mean differences) it is very likely that the researchers will need to use a linear or nonlinear linking technique described above to rescale the test scores or the impact estimates prior to analyses. If linear or nonlinear transformations of individual test scores cannot be implemented due to differences in the study samples across grades or states, separate impact analyses should be conducted for each grade and state, with combined effect size estimates produced only when scores from different states can be rescaled relative to state-level means and variances as described above.

Although similar in some respects to pooling individual test scores after rescaling to z-scores or another common metric, a meta-analysis is theoretically different in that data from different state tests are not treated as though they produce equated scores that can be pooled in a traditional analysis. Instead, a meta-analysis provides an estimate of the distribution of treatment effects from different studies. Variation in treatment effects is expected across grades, states, or both, and this variation may be explained by contextual variables that reflect differences in the tests or in study contexts. When the study design and analysis results support the notion that the variation in effect sizes is (a) due to random sampling variation, (b) adequately explained by contextual measures, or (c) ignorable based on the need for an impact estimate that is pooled across different sets of state standards, then an average treatment effect may be produced along with the standard error of the estimate. Otherwise, if the separate impact estimates cannot be pooled, then the distribution of effects may be presented without an average treatment effect. It is because of this ability to explicitly evaluate assumptions that we recommend a meta-analytic approach as opposed to indiscriminately pooling rescaled scores whenever cross-state or cross-grade average impact estimates are sought.

In the classical meta-analytic approach, separate effect estimates are produced for each grade and state and then combined using weighted average effect estimates or meta-regression models (including HLM meta-analytic models).29 In the weighted average estimate approach, each grade or state is treated as a separate substudy with a separate analysis and associated impact estimates. A second stage of analysis combines the grade-specific or state-specific estimates by averaging. Typically, the impact estimates from each grade and state are weighted by their associated sample size or by the inverse of their standard error of estimate.

Alternatively, a meta-regression model could be used to produce a combined effect size. These meta-regression models can be categorized into (a) those that use two stages of analysis in which grade- and state-specific impact estimates are produced in the first stage and then used as the dependent variable in the second stage; or (b) those that rely on a single statistical model in which student-level data are clustered by grade and state via random coefficients (for example, random slopes in HLM models) or fixed coefficients (for example, interactions in a regression model).

Method (a), the meta-regression using two stages, is useful only when there is a large number of impact estimates to be combined (at least 10 but preferably 30 or more) and moderator analyses are required (see footnote 24). Not surprisingly, although this method is often employed in meta-analytic and systematic reviews of literature comprised of many separate studies, it is unlikely to be appropriate for IES-funded RCTs given the relatively small number of grades and states in a single study.

Method (b), which employs a single regression model and uses fixed or random coefficients to distinguish impact estimates across grades or states, might be applicable to most multigrade or multistate RCTs. In this case, it is essential that the test scores from each grade and state be rescaled (as necessary) to a common scale (for example, grade-specific z-scores or adjusted z-scores based on statewide means and standard deviations) so that impact estimates from each grade and state are on the same scale.

In addition, whether to utilize fixed or random coefficients/effects in a meta-analysis is a key consideration. Fixed effects meta-analytic models impose the assumption that a single true impact estimate applies to every grade and state, and that differences in estimated impacts across grades and states are due purely to sampling variation and are not indicative of systematic differences in impacts across states. This approach may employ interaction terms involving the grade, state, and treatment indicators to explicitly test whether treatment effects are different across grades and/or states. Alternatively, random effects meta-analytic models assume that treatment effects will vary across grades and states, usually with the specific assumption that the impact estimates are reflective of a sample of impact estimates drawn from the normally distributed population of impact estimates for the population of contexts. Random effects metaanalytic models also allow one to test and model variance in impact estimates using moderator variables (see footnote 24).

Another distinction between fixed and random effects models is that the former produces results that cannot be generalized beyond the sample of grades and states in the study, whereas the latter considers the grades and states in the study to be a sample from a larger population of grades and states to which the results might be generalized. In studies that include multiple grades and multiple states, it is possible to use fixed effects for grades (because generalization beyond the grades included in the study might not be a goal) and random effects for states (permitting generalization to states not included in the study).

It is also important to recognize that, although the assumptions underlying a random effects analysis may be plausible in some studies, a limitation of the random effects analysis is that it requires a relatively large number of grades or states to produce stable estimates. While a fixed effects analysis can be run with as few as two grades or states, maximum likelihood estimation of random effects models can become unstable when the number of states is small (for example, fewer than 10).30 Given the routinely small number of grades and states in IES-funded RCTs, it is likely that fixed effects methods are most applicable for typical study designs; however, it is important to recognize that use of fixed effects methods implies that the results will not be generalized beyond the sample of grades and states in the study.

Combining Effect Estimates Using Proficiency Scores. The analysis of proficiency category scores is even more complicated than the analysis of scaled scores in multistate or multigrade studies. This is because state assessments have wide variation in cut points and corresponding levels of difficulty for defining proficiency (NCES 2007). This suggests that the impact estimates for an intervention might depend on the difficulty of the state test. On the other hand, one might invoke the same argument posed at several points in this report—if the research questions focus on the impacts of the intervention on students' proficiency rates, then differences in the difficulty of the assessments could be ignored because definitions of proficiency are set by state policy and are indicative of the natural variation in what it means to be proficient. A less-extreme position would involve a "middle-ground" approach in which state-specific impact estimates are produced, and variation in impacts across states is modeled using random effects. In studies involving several states, this variation may be analyzed to determine how differences in the difficulty of the tests might moderate program impacts on proficiency rates.


23 Helpful references addressing issues of linking, rescaling, and equating different assessments include Linn (1993); Mislevy (1992); and Feuer, Holland, Green, Bertenthal, and Hemphill (1999).
24 Moderator analyses utilize regression methods to examine systematic variation in the size of treatment effects as it relates to variation in context or conditions across sites (Baron and Kenny 1986). It is important to note that moderator analyses are exploratory, however. They cannot differentiate between differences in effect sizes due to differences in the tests versus differences in program implementation across states and/or grades.
25 An additional reason researchers might chose to standardize by the state mean and standard deviation is if control group samples in each state are too small to reliably estimate within-sample means and standard deviations (Hedges 1981).
26 State-level means and standard deviations on the NAEP in numerous subjects and grades since 1990 can be accessed through the online NAEP Data Explorer at
27 For an example in which these methods were used to link different assessments across grades, see May and Supovitz (2006).
28 In general, the choice of estimate should be justified such that it provides a reasonable approximation for the standard deviation of the population of students targeted by the intervention. The standard deviation for the target population might be estimated using data from a single grade and state or as a pooled estimate across grades and states. If tests with different score scales are used across multiple grades or states, then the standard deviation for each grade and state would first need to be expressed as the square-root of the ratio of sample variance to statewide variance. For example, a rescaled standard deviation of 0.5 denotes a sample variance (0.25) that is one-quarter of the statewide variance.
29 Note that a pooled multistate HLM analysis of rescaled scores should produce standardized effect sizes that are similar to weighted averages of standardized effect sizes calculated separately for each grade and/or state (Raudenbush & Bryk, 2002). However, a key benefit of analyzing pooled individual-level data is the ability to produce more efficient estimates through multilevel analyses if the assumptions about the structure of the error term underlying the multistate HLM model hold (Littell, Milliken, Stroup, Wolfinger, & Schabenberger, 2006). If the assumptions do not hold, results of this model will be biased. In that case, other models using robust standard errors via generalized estimating equations or Taylor-series estimation might be more appropriate (Liang & Zeger, 1986; Goldstein, 2003).
30 If a random effects model must be run with a small number of states, it might be possible to produce unbiased and consistent estimates using Bayesian estimation. The drawback of the Bayesian approach is that the analyst might be forced to make strong assumptions about the variance in the distribution of treatment effects when the number of grades or states in a study is small.