Studies that examine the impacts of education interventions on key student, teacher, and school outcomes typically collect data on large samples and on many outcomes. In analyzing these data, researchers typically conduct multiple hypothesis tests to address key impact evaluation questions. Tests are conducted to assess intervention effects for multiple outcomes, for multiple subgroups of schools or individuals, and sometimes across multiple treatment alternatives.
In such instances, separate t-tests for each contrast are often performed to test the null hypothesis of no impacts, where the Type I error rate (statistical significance level) is typically set at α = 5 percent for each test. This means that, for each test, the chance of erroneously finding a statistically significant impact is 5 percent. However, when the hypothesis tests are considered together, the "combined" Type I error rate could be considerably larger than 5 percent. For example, if all null hypotheses are true, the chance of finding at least one spurious impact is 23 percent if 5 independent tests are conducted, 64 percent for 20 tests, and 92 percent for 50 tests (as discussed in more detail later in this report). Thus, without accounting for the multiple comparisons being conducted, users of the study findings may draw unwarranted conclusions.
At the same time, statistical procedures that correct for multiple testing typically result in hypothesis tests with reduced statistical power—the probability of rejecting the null hypothesis given that it is false. Stated differently, these adjustment methods reduce the likelihood of identifying real differences between the contrasted groups. This is because controlling for multiple testing involves lowering the Type I error rate for individual tests, with a resulting increase in the Type II error rate. Simulation results presented later in this report show that if statistical power for an uncorrected individual test is 80 percent, the commonlyused Bonferroni adjustment procedure reduces statistical power to 59 percent if 5 tests are conducted, 41 percent for 20 tests, and 31 percent for 50 tests. Thus, multiplicity adjustment procedures can lead to substantial losses in statistical power.
There is disagreement about the use of multiple testing procedures and the appropriate tradeoff between Type I error and statistical power (Type II error). Saville (1990) argues against multiplicity control to avoid statistical power losses, and that common sense and information from other sources should be used to protect against errors of interpretation. Cook and Farewell (1996) argue that multiplicity adjustments may not be necessary if there is a priori interest in estimating separate (marginal) treatment effects for a limited number of key contrasts that pertain to different aspects of the intervention. Some authors also contend that the use of multiplicity corrections may be somewhat ad hoc because the choice of the size and composition of the family tested could be "manipulated" to find statistical significance (or insignificance). Many other authors argue, however, that ignoring multiplicity can lead to serious misinterpretation of study findings and publishing bias (see, for example, Westfall et al. 1999). These authors argue also that the choice of the tested families should be made prior to the data analysis to avoid the manipulation of findings.
Multiple comparisons issues are often not addressed in impact evaluations of educational interventions or in other fields. For example, in a survey of physiology journals, Curran-Everett (2000) found that only 40 percent of articles reporting results from clinical trials addressed the multiple comparisons problem. Hsu (1996) reports also that multiple comparisons adjustment procedures are often used incorrectly.
Accordingly, the Institute of Education Sciences (IES) at the U.S. Department of Education (ED) contracted with Mathematica Policy Research, Inc. (MPR) to develop guidelines for appropriately handling multiple testing in education research. These guidelines—which are presented in this report— were developed with substantial input from an advisory panel (Appendix A lists the panel members). The views expressed in this report, however, are those of the author.
The remainder of this report presents the guidelines for multiple testing, followed by several technical appendixes to help researchers apply the guidelines. Appendix B provides more details on the nature of the multiple testing problem and the statistical solutions that have been proposed in the literature. Appendix C discusses the creation of composite outcome measures, which is a central feature of the recommended procedures. Finally, Appendix D presents the Bayesian hypothesis testing approach, which is the main alternative to the classical hypothesis testing framework that is assumed for this report.