Skip Navigation
Technical Methods Report: Guidelines for Multiple Testing in Impact Evaluations

NCEE 2008-4018
May 2008

Appendix C: Weighting Options for Constructing Composite Domain Outcomes

The testing guidelines discussed in this report focus on significance tests for composite domain outcomes, which are combinations of individual domain outcomes. Thus, a critical issue is how to weight individual domain outcomes to construct composites.

There is a large literature over many decades and across multiple disciplines on methods to combine multiple pieces of information to create composites (see, for example, Kane and Case 2004, Wainer and Thissen 1993, Wang and Stanley 1970, Gulliksen 1950, Wilks 1938). Similar to the multiple comparisons literature, there is no consensus on the "optimal" method that should be used to form composites that fits all circumstances. Rather, procedures should be selected that are best suited to the types of domain outcome measures that are under investigation and key research questions.

Composite formation rules should be specified in the study protocols. In developing rules, potential correlations among the domain outcomes should be considered. As discussed, outcome measures will likely be grouped into a domain if they are expected to measure a common latent construct. In situations where this objective is satisfied, different composite formation methods should yield similar composites (Landis et al. 2000). However, if the domain outcomes tap multiple factors, the choice of method could affect the resulting composites, in which case it may be appropriate to reconsider domain definitions. Thus, issues pertaining to the selection of weights for composite formation are similar to issues pertaining to the delineation of outcome domains.

This appendix briefly discusses composite formation methods that are found in the literature that fit our context. A full literature review is beyond the scope of this introductory discussion.

For the ensuing discussion, it is assumed that a domain contains N outcome measures for each sample member, and that the vector Yi pertains to outcome measure values for outcome i. The outcomes are assumed to be standardized to have mean 0 and standard deviation 1 to avoid the composite being dominated by component outcomes with large variances, although the weighting schemes discussed below apply also to raw outcomes. A composite domain outcome, C, is defined as follows: composite domain outcome, where wi are "nominal" weights assigned to each outcome. The many interrelated methods of weighting that have been used in education research involve selecting the wi s to maximize various criterion functions, as discussed next.

Regression weights. Suppose a pertinent outside data source contains information on a well-established observable validity criterion as well as the outcome measures under investigation. These data could then be used to construct weights based on the relationship between the criterion measure and the outcomes. Examples of external criteria are measures of school readiness, longer-term test scores, high school graduation status, college attendance status, and earnings. Regression weights can be obtained by regressing the criterion measure (the dependent variable) on the component outcomes (the independent variables) using standard multivariate regression methods. The parameter estimates on the outcomes could then be used as weights for composite formation using the evaluation sample. The larger the correlation between an outcome and the criterion and the more independent an outcome is of other outcomes, the larger is the weight, all else equal. This approach yields weights that minimize the mean squared prediction error in the estimation sample.

The advantage of this method is that predictive validity is often recognized as a more important criterion than reliability in evaluating measurement procedures (Kane and Case 2004, Wang and Stanley 1970). However, this method can be used only if pertinent data are available to estimate regression weights that apply to the population under investigation and that are based on large samples to ensure the stability of results. Furthermore, even if these conditions are met, the results need to be interpreted carefully, because the estimation sample used to develop the weights may not necessarily yield optimal weights for the population from which the sample is drawn or for other samples drawn from the same population (Raju et al. 1997).

Finally, this approach is likely to be most useful when the outcomes (predictors) are relatively independent. This independence condition, however, will not typically be satisfied in our context. Thus, the regression approach may be more suited to forming composites for a between-domain analysis than a within-domain analysis.

Natural or unit weights. For this approach, the N outcomes are simply summed or averaged to form the composite—that is, wiis set to 1 (or a constant) for each outcome. This method is based on the "agnostic" criterion that each outcome is equally important. It has the advantage that is easy to apply and understand. Bobko et al. (2007) show that this approach can be appropriate under many circumstances.

The use of unit weights does not necessarily imply, however, that each outcome contributes equally to the overall variance of the composite. The contribution of Yi to the variance of C is variance of <em>C</em>, where ρij is the correlation between Yi and Yj. With unit weights, this contribution reduces to contribution reduction, so that the "effective" weight for each component outcome will depend on its average correlation with other component outcomes. If average correlations are similar across outcomes, the effective and nominal unit weights will be similar.

A variant of this method is to select weights to ensure that each variable contributes equally to the total variance of the composite. This can be done by setting N equations of the form <em>N</em> equations to a constant and solving iteratively for wi (Wilks 1938).

Expert judgment or subjective weights. Another approach to developing composites is to employ a content-oriented strategy in which outcomes are assigned to composites based on existing theory or rational judgment. Under this approach, theory or expert guidance obtained prior to data analysis is used to determine the relative “importance” of each outcome to the underlying domain construct. This approach could also be used if some outcomes have more “information” than others. For example, for combining tests, weights could be assigned based on the length of the tests or the nature of the questions (Wainer and Thissen 1993). For instance, larger weights could be assigned to multiple-choice than truefalse questions.

Maximum reliability weights. Another approach is to select weights to maximize the reliability of the composite. This approach is often discussed in the test theory literature for combining achievement test scores or items (Kane and Case 2004, Wainer and Thissen 1993). This approach has received attention because reliability of a measure is a necessary, although not sufficient, condition for validity of a measure.

Reliability is defined as the proportion of the total variance in the composite which is true-composite variance. Thus, maximum reliability weights can be found by maximizing the variance of the composite between subjects (between-subject variance) relative to the variance across outcomes within subjects (within-subject variance). Wang and Stanley (1970) and Gulliksen (1950) discuss procedures for obtaining these weights. Item response theory (IRT) is a more recent version of reliability weighting that simultaneously provides weights and a scale for the item responses (see, for example, Lord 1980).

Equal correlation weights. Another criterion function is to select weights that equalize the correlation between each outcome measure and the composite. The correlation between Yi and the composite C can be expressed as follows:

correlation between <em>Y<sub>i</sub></em> and the composite <em>C</em>

Because the denominator in (1) is the same for each outcome, equal correlation weights can be calculated by setting N equations of the form <em>N</em> equations of the form to a constant and solving iteratively for wi. This procedure is logically consistent only if all outcomes are positively correlated.

Factor analysis weights. Another approach for forming composites is to conduct a factor analysis on the component outcomes and to use the single factor solution as the composite outcome. If the data support a multi-factor solution, domain reconfigurations may be considered. Criteria for assessing the appropriate factor structure must be specified in the study protocols and adhered to in the analysis.

Alternatively, factor loadings from factor analyses conducted on other relevant datasets (perhaps with larger norming samples) could be applied as weights to form composites. Issues pertaining to the feasibility of this approach and the interpretation of results are similar to those discussed above for regression weighting.

Multivariate analysis of variance (MANOVA) weights. MANOVA methods are commonly used to control the Type I error rate when examining treatment effects on multiple outcome measures (see, for example, Harris 1975). In our context, the MANOVA approach would involve conducting omnibus F-tests to address the research question: Did the intervention have a statistically significant effect on any outcome measure within the domain? This is a different question than the one addressed by the composite t-test approach: Did the intervention have a statistically significant effect on a common domain latent construct?

Because they address different research questions, these two approaches lead to different weighting schemes for combining the outcome measures. Under the MANOVA approach, weights are found to maximize the test statistics pertaining to impacts, whereas under the composite t-test approach, weights are found to best identify a common domain construct.

To demonstrate the implied weighting scheme for the MANOVA approach, assume that the N domain outcomes for each subject are sampled from a joint multivariate normal distribution with mean vector μT for mT treatments and μC for mC controls and common variance-covariance matrix Ω. Consider the composite impact estimate, CI:

imact estimate

where Ii is the impact estimate (mean treatment-control difference) for outcome i and I and w are Nx1 column vectors of impacts and weights, respectively. The squared t-statistic can then be written as follows:

imact estimate

where Ω̂ is the usual estimator for Ω based on sample variances and covariances.

The omnibus F-statistic that is typically produced by statistical software packages can be obtained by first finding the weights w* that maximize (2) subject to the normalizing restriction w′*Ω̂w*=1, and then inserting w* into (2). This procedure yields Hotelling's T2 statistic:

imact estimate

which is just a multiple of the usual F-statistic.5

Thus, all else equal, MANOVA methods tend to place more weight on standardized outcomes with larger impacts (positive or negative) than smaller impacts. Stated differently, this method tends to select weights to maximize the chance of finding significant impact findings.

The MANOVA approach is not recommended for the confirmatory analysis for several reasons. First, because domain outcomes are likely to tap the same underlying construct, it seems more appropriate to examine treatment effects on a composite measure of this construct than to test whether treatment effects exist for any of its components. A second reason is that it is difficult to develop a confirmatory theory that would result in outcomes being weighted according to the size of their impacts. Instead, the MANOVA procedure is a post hoc, data-driven method that is more suited to exploratory analyses.

Top

5 Specifically, F =kT2 where k =(mT +mC -N -1)/N(mT +mC -2); F is distributed as F(N, mT +mC -N -1).