Skip Navigation
Technical Methods Report: Guidelines for Multiple Testing in Impact Evaluations

NCEE 2008-4018
May 2008

Chapter Two: Guidelines for Multiple Testing

This section first discusses basic principles for addressing multiplicity, followed by a presentation of testing strategy guidelines. The focus is on designs with a single treatment and control group where data on multiple outcomes are collected for each sample member. These are the most common designs that are used in IES-funded education research. Guidelines are also provided for subgroup analyses and for designs with multiple treatment groups. The guidelines are consistent with those proposed for medical trials (see, for example, Lang and Secic 2007, CPMP 2002, and Altman et al. 2001), but are designed for evaluations of education interventions.

This report provides a structure to address the multiplicity problem and discusses issues to consider when formulating a testing strategy. The report does not provide step-by-step instructions on how to apply the guidelines, which is not possible due to the myriad types of impact evaluations that are conducted in the education field. Specific details on the use of the guidelines will vary by study depending on the interventions being tested, target populations, key research questions, and study objectives.

Finally, the guidelines assume that a classical (frequentist) hypothesis testing approach is used to analyze the data, because this is the testing strategy that is typically used in impact evaluations in the education field. Appendix B discusses the basic features of this testing approach. (Appendix D discusses the alternative Bayesian approach.)

Basic Principles

  1. The multiple comparisons problem should not be ignored.

    The multiple comparisons problem can lead to erroneous study conclusions if the α level for individual tests is not adjusted downward. At the same time, strategies for dealing with multiplicity must strike a reasonable balance between testing rigor and statistical power—the chance of finding truly effective interventions.

  2. Limiting the number of outcomes and subgroups forces a sharp focus and is one of the best ways to address the multiple comparisons problem.

    Multiple testing is less of a problem if studies limit the number of contrasts for analysis. Sharply focusing research questions on one or a few outcomes and on a small number of target groups diminishes the chance of finding impacts where none exist.

    At the same time, in some studies, theory and prior research may not support a sharp focus on outcomes or subgroups and, in others, the tested interventions may be expected to have a range of effects. Furthermore, in a context where IES and other funders are executing costly studies to identify promising approaches to difficult problems, narrowing the range of outcomes and subgroups limits researchers' ability to use post hoc exploratory analyses to find unexpected, yet policy-relevant information.

    Thus, the multiple comparisons testing strategy should be flexible to allow for (1) confirmatory analyses to assess how strongly the study's pre-specified central hypotheses are supported by the data, and (2) exploratory analyses to identify hypotheses that could be subject to future rigorous testing.

  3. The multiple comparisons problem should be addressed by first structuring the data. Furthermore, protocols for addressing the multiple comparisons problem should be made before data analysis is undertaken.

    The multiple comparison testing strategy should be based on a process that first groups and prioritizes outcomes. The structuring of the data should be specified during the design stage of the study and published before study data are collected and analyzed. Multiple comparisons corrections should not be applied blindly to all outcomes, subgroups, and treatment alternatives considered together. This approach would produce unnecessarily large reductions in the statistical power of the tests. Rather, the testing strategy should strike a reasonable balance between testing rigor and statistical power.

    Specific plans for structuring the data and addressing the multiple comparisons issue will depend on the study objectives. However, the testing strategy described next pertains broadly to impact evaluations that are typically conducted in the education field.

Guidelines for Developing a Strategy for Multiple Testing

  1. Delineate separate outcome domains in the study protocols.

    Outcome domains should be delineated using theory or a conceptual framework that relates the program or intervention to the outcomes. The domains should reflect key clusters of constructs represented by the central research questions of the study.

    The outcome domains, for example, could be defined by grouping outcomes that are deemed to have a common latent structure (such as test scores in particular subject areas, behavioral outcomes, or measures of classroom practices) or grouping outcomes with high correlations. Domains could also be defined for the same outcomes measured over time (for example, test scores collected at various follow-up points). Domains could pertain to specific population subgroups (for example, if the intervention is targeted primarily to students with particular characteristics, such as English language learners).

  2. Define confirmatory and exploratory analysis components prior to data analysis.

    The confirmatory analysis should provide estimates whose statistical properties can be stated precisely. The goal of this analysis is to present rigorous tests of the study's central hypotheses that are specified in the study protocols. The confirmatory analysis must address multiple comparison issues and must have sufficient statistical power to address the main research questions. This analysis could consist of two parts: (1) testing for impacts for each outcome domain separately, and (2) jointly testing for impacts across outcome domains. These analyses do not necessarily need to include all domains.

    The purpose of the exploratory analysis is to examine relationships within the data to identify outcomes or subgroups for which impacts may exist. The goal of the exploratory analysis is to identify hypotheses that could be subject to more rigorous future examination, but cannot be examined in the present study because they were not identified ahead of time or statistical power was deemed to be insufficient. Results from post hoc analyses are not automatically invalid, but, irrespective of plausibility or statistical significance, they should be regarded as preliminary and unreliable unless they can be rigorously tested and replicated in future studies.

  3. For domain-specific confirmatory analyses, conduct hypothesis testing for domain outcomes as a group.

    Outcomes will likely be grouped into a domain if they are expected to measure a common latent construct (even if the precise psychometric properties of the domain "items" are not always known in advance). Thus, conducting tests for domain outcomes as a group will measure intervention effects on this common construct. Combining outcomes that each tap the same latent construct could also yield test statistics with greater statistical power than if individual outcomes were examined one at a time.

    A composite t-test approach is recommended for testing global hypotheses about a domain. Under this approach, significance testing is performed on a single combination of domain outcomes. This procedure accounts for multiple comparisons by reducing the domain outcome to a single composite measure. This approach addresses the question "Relative to the status quo, did the intervention have a statistically significant effect on a typical domain outcome or common domain latent factor?" Appendix C discusses possible options for defining weights to construct composite outcome measures.

    A statistically significant finding for a composite measure provides confirmatory evidence that the intervention had an effect on the common domain latent construct. When statistical significance of the composite has been established, a within-domain exploratory analysis could be conducted using unadjusted p-values to identify specific domain outcomes that contributed to the significant overall effect. The significance of a particular outcome does not provide confirmatory evidence about the domain as a whole, but provides information that could be used to help interpret the global findings.

    If the impact on the composite measure is not statistically significant, it is generally not appropriate to examine the statistical significance of the individual domain outcomes. However, if such analyses are performed, they must be qualified as exploratory.

  4. Use a similar testing strategy if the confirmatory analysis involves assessing intervention effects across domains.

    Providing confirmatory evidence about intervention effects for each domain separately may satisfy research objectives in some studies. However, if an intervention is to be judged based on its success in improving outcomes in one or more domains, the study may wish to obtain confirmatory summative evidence about intervention effects across domains. For instance, if test score and school attendance outcomes are delineated into separate domains, it may be of interest to rigorously test whether the intervention improved outcomes in either domain (or in both domains). In these cases, the study may wish to conduct hypothesis tests when the domains are considered together.

    The appropriate use of multiplicity adjustments for such analyses will depend on the main research questions for assessing overall intervention effects. For example, the research question of interest may be "Did the intervention have an effect on each domain?" In this case, multiple comparisons corrections are not needed; separate t-tests should be conducted at the α significance level for each domain composite outcome, and the null hypothesis of no treatment effect in at least one domain would be rejected if each composite impact is statistically significant. This approach applies a very strict standard for confirming intervention effects.

    The research question could instead be "Did the intervention have an effect on any domain?" In this case, multiplicity corrections are warranted. Different domains will likely tap different underlying latent factors and dimensions of intervention effects. Thus, rather than conducting a t-test on an aggregate composite measure across domains (which could be difficult to interpret), hypothesis testing could be conducted for each domain composite individually using the recommended statistical adjustment procedures discussed in Appendix B. The null hypothesis of no treatment effect in each domain would then be rejected if any domain impact is statistically significant after applying the adjustment procedures.

  5. Multiplicity adjustments are not required for exploratory analyses.

    For exploratory analyses, one approach is to conduct unadjusted tests at the α level of significance. However, to minimize the chance of obtaining spurious significant findings, researchers may wish to apply multiplicity adjustments for some exploratory analyses if the data can be structured appropriately and statistical power levels are deemed to be tolerable.

    Study reports should explicitly state that exploratory analyses do not provide rigorous evidence of the intervention's overall effectiveness. Results from post hoc analyses should be reported as providing preliminary information on relationships in the data that could be subject to more rigorous future examination. These qualifications apply even if multiple comparisons correction procedures are used for exploratory analyses.

  6. Specify which subgroups will be part of the confirmatory analysis and which ones will be part of the exploratory analysis.

    If the study seeks to make rigorous claims about intervention effects for specific population subgroups, this should be specified in the study protocols and embedded in the confirmatory analysis strategy. To limit the multiple testing problem, only a limited number of educationally meaningful subgroups should be included in the analysis. The direction of the expected subgroup effects should be specified and the exact subgroup definitions should be defined explicitly at the outset to avoid post hoc data-dependent definitions. Furthermore, to ensure treatment-control group balance for each subgroup, efforts should be made to conduct random assignment within strata defined by the subgroups. The case for including subgroups in the confirmatory analysis is stronger if there is a priori reason to believe that treatment effects differ across the subgroups.

    The testing strategy for the subgroup analysis needs to address the simultaneous testing of multiple subgroups and outcomes and should link to the key study research questions. For example, suppose that hypotheses are postulated about intervention effects on a composite outcome across gender and age subgroups. Suppose also that the two key research questions are (1) "Is the intervention more effective for boys than girls?" and (2) "Is the intervention more effective for older than younger students?" In this case, it is appropriate to examine whether intervention effects differ by gender (age) by conducting F-tests on treatment-by-subgroup interaction terms that are included in the regression models. If the gender and age subgroups are to be considered together, a multiple comparisons adjustment procedure could be applied to the p-values from the various subgroup tests (see Appendix B).

    Hypothesis tests of differential subgroup effects are appropriate if subgroup results are to be used to target future program services to specific students. In this case, the standard of confirmatory evidence about the subgroup findings should be set high. It is generally accepted in the medical literature that tests of interactions are more appropriate for subgroup analyses than separate, subgroup-specific analyses of treatment effects, because declarations of statistical significance are often associated with decision making (Brookes et al. 2001, Rothwell 2005, Gelman and Stern 2006).

    There may be instances in education research, however, where the key research question is "Did the intervention have an effect on a composite outcome for a specific subgroup in isolation?" In this case, multiplicity adjustments are not warranted, because program effects are to be examined for a single subgroup that is specified in advance. Multiplicity adjustments, however, are necessary for hypothesis tests that address the following research questions: "Did the intervention have an effect for either younger or older students?" or "Did the intervention have an effect for any gender or age subgroup?" Results from these tests, however, must not be interpreted as providing information about differential effects across subgroups.

    Impact findings for subgroups that are not part of the confirmatory analysis should be treated as exploratory. Furthermore, post hoc subgroup analyses must be qualified as such in the study reports.

  7. Apply multiplicity adjustments in experimental designs with multiple treatment groups.

    A rigorous standard of evidence should be applied in designs with multiple treatment groups before concluding that some (for example, specific reading curricula) are preferred over others. The confirmatory testing strategy for these designs must be specified prior to data analysis. The strategy could include global tests of differences across treatments, or tests of differences between specific treatment pairs that could be used to rank treatments. The strategy should also address simultaneous testing of multiple treatments, outcomes, and subgroups (if pertinent). As discussed in Appendix B, multiplicity adjustment procedures have been developed for situations where multiple treatments are compared to each other or to a common control group.

  8. Design the evaluation to have sufficient statistical power for examining intervention effects for all prespecified confirmatory analyses.

    Statistical power calculations for the confirmatory analysis must account for multiplicity. The determination of appropriate evaluation sample sizes will depend on the nature of the confirmatory analysis. For example, for domain-specific confirmatory analyses, the study should have sufficient power to detect impacts for composite domain outcomes. Similarly, if subgroup analyses are part of the confirmatory testing strategy, the power analysis should account for the simultaneous testing of multiple subgroups and multiple outcomes, and similarly for designs with multiple treatment groups. Brookes et al. (2001) show that if a study has 80 percent power to detect the overall treatment effect, the sample needs to be at least four times larger to detect a subgroup-by-treatment interaction effect of the same magnitude.

  9. Qualify confirmatory and exploratory analysis findings in the study reports.

    There is no one way to present p-values from the multiple comparisons tests that will fit the needs of all evaluation reports. In some instances, it may be preferable to present adjusted p-values in appendices and the unadjusted p-values in the main text, whereas in other instances, it may be preferable to present adjusted p-values in the main text or in footnotes. The reporting of adjusted or unadjusted confidence intervals could also be desirable.

    Some users of study reports may have a narrow interest in the effectiveness of the intervention for a specific outcome, subgroup, or treatment alternative. Where interest focuses on a specific contrast in isolation, the usual t-test conducted at significance level α is the appropriate test (and unadjusted p-values should be examined to assess statistical significance). This does not necessarily mean, however, that unadjusted p-values should be reported for all analyses to accommodate readers with myriad interests. Rather, study results should be presented in a way that best addresses the key research questions specified in the study protocols.

    It is essential that results from the confirmatory and exploratory analyses be interpreted and qualified appropriately and that the presentation of results be consistent with study protocols. Confirmatory analysis findings should be highlighted and emphasized in the executive summary of study reports.

Top