Skip Navigation
Technical Methods Report: Guidelines for Multiple Testing in Impact Evaluations

NCEE 2008-4018
May 2008

Appendix B: Introduction to Multiple Testing

This appendix introduces the hypothesis testing framework for this report, the multiple testing problem, statistical methods to adjust for multiplicity, and some concerns that have been raised about these solutions. The goal is to provide an intuitive, nontechnical discussion of key issues related to this complex topic to help education researchers apply the guidelines presented in the report. A comprehensive review of the extensive literature in this area is beyond the scope of this introductory discussion. The focus is on continuous outcomes, but appropriate procedures are highlighted for other types of outcomes (such as binary outcomes). The appendix concludes with recommended methods.1

The Hypothesis Testing Framework

In this report, it has been assumed that a classical (frequentist) hypothesis testing approach is used to analyze the data; this is the testing strategy that is typically used in IES evaluations. This section highlights key features of this approach. Appendix D summarizes key features of the alternative Bayesian testing approach.

To describe the classical approach, it is assumed that treatment and control groups are randomly selected from a known population and that data on multiple outcomes are collected on each sample member. For contrast j, let μTj and μCj be population means for the treatment and control groups (or two treatment groups), respectively, and let δjTjCj be the population average treatment effect (impact). In the classical framework, population means and, hence, population impacts are assumed to be fixed.

Statistical analysis under this approach usually centers on a significance test—such as a two-tailed ttest— of a null hypothesis H0j: δj =0 versus the alternative hypothesis H1j: δ0j ≠0.2 The Type I error rate—the probability of rejecting H0j given that it is true—is typically set at α =5 percent for each test. Evaluation sample sizes are typically determined so that statistical power—the probability of rejecting H0j given that it is false—is 80 percent if the true impact is equal to a value that is deemed to be educationally meaningful or realistically attainable by the intervention.

Under this framework, a null hypothesis is typically rejected (that is, an impact is declared statistically significant) if the p-value for the statistical test is less than 5 percent, or equivalently, if the 95 percent confidence interval for the contrast does not contain zero. This is a frequentist approach, because in hypothetical repeated sampling of population subjects to the treatment and control groups, 95 percent of the constructed confidence intervals would contain the true, fixed population impact. Probabilistic statements can be made about the random confidence interval but not about the fixed impact.

What Is the Multiple Testing Problem?

Researchers typically perform many simultaneous hypothesis tests when analyzing experimental data. Multiple tests are conducted to assess intervention effects (treatment-control differences) across multiple outcomes (endpoints). In some evaluations, multiple tests are also conducted to assess differences in intervention effects across multiple treatment groups (such as those defined by various reading or math curricula) or population subgroups (such as student subgroups defined by age, gender, race/ethnicity, or baseline risk factors).

In such instances, separate t-tests for each contrast are often performed to test the null hypothesis of no impacts, where the Type I error rate is typically set at α = 5 percent for each test. Thus, for each test, the chance of erroneously finding a statistically significant impact is 5 percent. However, when the "family" of hypothesis tests are considered together, the "combined" Type I error rate could be considerably larger than 5 percent. This is the heart of the multiple testing problem.

For example, suppose that the null hypothesis is true for each test and that the tests are independent. Then, the chance of finding at least one spurious impact is 1 - (1 -α)N , where N is the number of tests. Thus, the probability of making at least one Type I error is 23 percent if 5 tests are conducted, 64 percent for 20 tests, and 92 percent for 50 tests (Table B.1).

The definition of the combined Type I error rate has implications for the strategies used to adjust for multiple testing and for interpreting the impact findings. The next section discusses the most common definitions found in the literature and provides numerical examples.

Definitions of the Combined Type I Error Rate
The two most common definitions of the combined Type I error rate found in the literature are the (1) familywise error rate (FWER) and (2) false discovery rate (FDR):

  • The FWER, defined by Tukey (1953), has traditionally been the focus of research in this area. The FWER is the probability that at least one null hypothesis will be rejected when all null hypotheses are true. For example, when testing treatment-control differences across multiple outcomes, the FWER is the likelihood that at least one impact will be found to be significant when, in fact, the intervention had no effect on any outcome. As discussed, the FWER is 1 - (1 -α)N for independent tests, where N is the number of tests (dependent tests are discussed below).
  • The FDR, defined by Benjamini and Hochberg (1995), is a more recent approach for assessing how errors in multiple testing could be considered. The FDR is the expected proportion of all rejected null hypotheses that are rejected erroneously. Stated differently, the FDR is the expected fraction of significant test statistics that are false discoveries.

Table B.2 helps clarify these two error rates. Suppose that multiple tests are conducted to assess intervention effects on N study outcomes and that M null hypotheses are true (M is unobservable). Suppose further that based on t-tests, Q null hypotheses are rejected and that A, B, C, and D signify cell counts when t-test results are compared to the truth. The counts Q and A to D are random variables.

In Table B.2, the FWER is the probability that the random variable B is at least 1 among the M null hypotheses that are true. The FDR equals the expected value of B /Q, where B /Qis defined to equal 0 if Q = 0.3 If all null hypotheses are true, then B =Q and the FDR and FWER are equivalent, otherwise the FDR is smaller than or equal to the FWER.

The two error rates have a different philosophical basis. The FWER measures the likelihood of a single erroneous rejection of the null hypothesis across the family of tests. Researchers who focus on the FWER are concerned with mistakenly reporting any statistically significant findings. The concerns are that unwarranted scientific conclusions about evaluation findings could be made as a result of even one mistake and that researchers may select erroneous significant findings for emphasis when reporting and publishing results.

The rationale behind the FDR is that a few erroneous rejections may not be as problematic for drawing conclusions about the family tested when many null hypotheses are rejected as they would be if only a few null hypotheses are rejected. The rejection of many null hypotheses is a signal that there are real differences across the contrasted groups. Thus, researchers might be willing to tolerate more false positives (that is, larger values for B) if realized values for Q were large than if they were small. Under this approach, conclusions regarding intervention effects are to be based on the preponderance of evidence; the set of discoveries is to be used to reach an overall decision about the treatment. For those who adopt this approach, controlling the FWER is too conservative because, if many significant effects are found, a few additional errors will not change the overall validity of study findings.

Finally, variants of the FWER and FDR have been proposed in the literature. For example, Gordon et al. (2007) discuss a variant of the FWER—the per family error rate (PFER)—which is the expected number of Type I errors that are made (that is, the expected value of B in Table B.2). For instance, the PFER equals 5 if 5 of all tests with true null hypotheses are expected to be statistically significant. Gordon et al. (2007) argue that the use of the PFER (which focuses on expectations) rather than the FWER (which focuses on probabilities) could yield tests with more statistical power while maintaining stringent standards for Type I error rates across the family of tests. Storey (2002) introduced a variant of the FDR—the positive false discovery rate (pFDR)—which is the expected value of B /Q given that Q is positive. He argues that this measure may be more appropriate in some instances.

Quantifying the FWER and FDR
To demonstrate the relationship between the FDR and the FWER, simulated data were generated on N mean outcomes for samples of 1,000 treatment and 1,000 control group members (Table B.3). Data for mean outcome j were obtained as random draws from a normal distribution with standard deviation 1 and with mean μCj for controls and μTj for treatments. The impacts (μTjCj) were set to 0 for M outcomes with true null hypotheses and to 0.125 for (N -M) outcomes with false null hypotheses; the 0.125 value was set so that the statistical power of the tests was 80 percent. Each individual hypothesis was tested using a two-tailed t-test at the 5 percent significance level and the test statistics were generated independently. Each simulation involved 10,000 repetitions. Simulations were conducted for N = 5, 10, 20, and 50 and for M /N = 100, 80, 50, and 20 percent.

The main results in Table B.3 are as follows:

  • The FWER increases substantially with the number of tests. If all null hypotheses are true, the FWER is 23 percent for 5 independent tests, 64 percent for 20 independent tests, and 92 percent for 50 independent tests.
  • The FWER and FDR are equivalent if all null hypotheses are true; otherwise, the FDR is less than the FWER. Thus, procedures that control the FWER also control the FDR, but the reverse does not necessarily hold. Differences between the FDR and FWER become larger as (1) the number of tests increases and (2) the number of true null hypotheses decreases (that is, when many differences across the contrasted groups truly exist).

These results suggest that the FDR is a less conservative measure than the FWER, especially if a considerable fraction of all null hypotheses are false. Thus, as demonstrated below, methods that control the FDR could yield tests with greater statistical power than those that control the FWER. The choice of which error criterion to control is important and must be made prior to the data analysis.

What Are Statistical Solutions to the Multiple Testing Problem?

A large body of literature describes statistical methods to adjust Type I errors for multiple testing (see, for example, the books by Westfall et al. 1999, Hsu 1996, and Westfall and Young 1993). The literature suggests that there is not one method that is preferred in all instances. Rather, the appropriate measure will depend on the study design, the primary research questions that are to be addressed, and the strength of inferences that are required.

This section briefly summarizes the literature in this area. Methods that control the FWER are discussed first, and methods that control the FDR are discussed second. Statistical packages (such as SAS) can be used to apply many of these methods.

Methods for FWER Control
Until recently, most of the literature on multiple testing focused on methods to control the FWER at a given α level (that is, methods to ensure that the FWER ≤ α). The most well-known method is the Bonferroni procedure, which sets the significance level for individual tests at α/N, where N is the number of tests.

The Bonferroni procedure controls the FWER when all null hypotheses are true or when some are true and some are false (that is, it provides "strong" control of the FWER). This feature differs from another well-known procedure—Fisher's protected least significant difference (LSD)—where an overall F-test across the tests is first conducted at the α level and further comparisons about individual contrasts are conducted at the α level only if the F-test is significant (Fisher 1935). Fisher's LSD controls the FWER only when all null hypotheses are true and, thus, provides "weak" control of the FWER. This means that Fisher's LSD may not control the FWER for second-stage individual hypotheses.4 The same issue applies to other multiple-stage tests, such as the Newman-Keuls (Newman 1939, Keuls 1952) and Duncan (1955) methods.

The Bonferroni method applies to both continuous and discrete data, controls the FWER when the tests are correlated, and provides adjusted confidence bounds (by using α/N rather than α in the calculations). Furthermore, it is flexible because it controls the FWER for tests of joint hypotheses about any subset of N separate hypotheses (including individual contrasts). The procedure will reject a joint hypothesis H0 if any p-value for the individual hypotheses included in H0 is less than α/N. The Bonferroni method, however, yields conservative bounds on Type I error and, hence, has low power.

Many modified and sometimes more powerful versions of the Bonferroni method have been developed that provide strong control of the FWER. We provide several examples:

  • Šidák (1967) developed a slightly less conservative bound where the significance level for individual tests is set at 1 -(1 -α)1/N rather than α/N. This method has properties similar to those of the Bonferroni method and is slightly more powerful, although it does not control the FWER in all situations in which test statistics are dependent.
  • Scheffé (1959) developed an alternative procedure where two means are declared significantly different if |t|≥ √(N-1)F(α;N-1,ν), where t is the t-statistic and F(.) is the α-level critical value of the F distribution with (N -1) numerator and ν denominator degrees of freedom. This procedure has the nice property that if the F-test for the global hypothesis is insignificant, then the Scheffé method will never find any mean difference to be significant. The procedure applies also to all linear combinations of contrasts. It tends to be more powerful than the Bonferroni method if the number of tested contrasts is large (more than 20), but tends to be less powerful than the Bonferroni method for fewer tests.
  • Holm (1979) developed a sequential "step-down" method: (1) order the p-values from the individual tests from smallest to largest, p(1)p(2)...≤ p(N), and order the corresponding null hypotheses H0(1), H0(2),...,H0(N); (2) define k as the minimum j such that p(j) > α / (N -j + 1); and (3) reject all H0(j) for j = 1,…,(k -1). This procedure is more powerful than the Bonferroni method because the bound for this method sequentially increases whereas the Bonferroni bound remains fixed. The Holm method controls the FWER in the strong sense, but cannot be used to obtain confidence intervals.
  • Hochberg (1988) developed a "step-up" procedure that involves sequential testing where pvalues are ordered from largest to smallest (rather than vice versa as for the Holm test). The method first defines k as the maximum j such that p(j)α/(N -j +1), and then rejects all H0(j) for j = 1,...,k. This procedure is more powerful than the Holm method, but the control of the FWER is not guaranteed for all situations in which the test statistics are dependent (although simulation studies have shown that it is conservative under many dependency structures).
  • Rom (1990) derived a step-up procedure similar to Hochberg's procedure that uses different cutoffs and has slightly more power because it exactly controls the FWER at α for independent test statistics.

Bootstrap and permutation resampling methods are alternative, computer-intensive methods that provide strong control of the FWER (see, for example, Westfall and Young 1993 and Westfall et al. 1990). These methods incorporate distributional and correlational structures across tests, so they tend to be less conservative than the other general-purpose methods and, hence, may have more power. Furthermore, they are applicable in many testing situations. These methods can be applied as follows:

  • Generate a large number of pseudo data sets by selecting observations with replacement (for the bootstrap methods) or without replacement (for the permutation methods). The sampling should be performed without regard to treatment status. Instead, sampling is performed for the combined research groups, and sampled observations are randomly ordered and proportionately split into pseudo-research groups.
  • For each iteration, calculate pseudo-p-values for each t-test and store the minimum pseudo-pvalue across tests.
  • The adjusted p-value for an individual contrast is the proportion of iterations where the minimum pseudo-p-value is less than or equal to the actual p-value for that contrast.
  • Significance testing is based on the adjusted p-values.

The intuition behind this procedure is that the distribution of the maximum t-statistic (minimum p-value) provides simultaneous confidence intervals that apply to all tests under the null hypothesis. The resampling methods use the data to estimate this distribution, which yields the multiplicity-adjusted p-values. In essence, a hypothesis test is rejected if the actual t-statistic value for that test is in the tail of the maximum t-statistic distribution.

Alternative methods to control the FWER have been developed when the design contains several treatment and control groups. The Tukey-Kramer (Tukey 1953, Kramer 1956) method is applicable if all pairwise comparisons of treatment and control means are of primary interest. If comparisons with a single control group are of primary interest, the Dunnett (1955) method is appropriate. These methods account for the dependence across test statistics due to the repetition of samples across contrasts.

The use of planned orthogonal contrasts is another method that adjusts for dependency when T treatment and control groups are compared to each other (see, for example, Bechhofer and Dunnett 1982). To describe this procedure, let Yi be a mean outcome (composite) for research group i and let mean outcome (composite) where cji are constants such that constant equation =0 (j =1,...,(T -1)). The Cjs represent a family of (T -1) contrasts (linear combinations) of the Yis. Mutually orthogonal contrasts arise if sample sizes are the same in each treatment condition and treatment condition= for all jk. A property of orthogonal contrasts is that the total sum of squares across the T research groups can be partitioned into (T –1) sums of squares for each orthogonal contrast.

Significance testing can be performed for each orthogonal contrast using the multiple comparisons adjustment procedures discussed above. An advantage of this method is that testing problems associated with dependent test statistics disappear. Furthermore, if T is large, the use of orthogonal contrasts requires fewer test statistics than the Tukey-Kramer procedure, thereby reducing the multiple comparisons problem.

The use of planned orthogonal contrasts may be desirable if they correspond to key research questions and can be easily interpreted. However, this approach may not be appropriate if the key contrasts of interest are not mutually orthogonal.

Finally, special tests for controlling the FWER have been developed for binary or discrete data (see, for example, Westfall et al. 1999). Resampling methods or a modified Bonferroni method (Westfall and Wolfinger 1997) can be used in these instances. An alternative is the Freeman-Tukey Double Arcsine Test (Freeman and Tukey 1950).

Methods for FDR Control
Benjamini and Hochberg (1995) showed that when conducting N tests, the following four-step procedure will control the FDR at the α level:

  1. Conduct N separate t-tests, each at the common significance level α.
  2. Order the p-values of the N tests from smallest to largest, where p(1)p(2) ≤...≤ p(N) are the ordered p-values.
  3. Define k as the maximum j for which p(j)(j/N)α.
  4. Reject all null hypotheses Ho(j)j = 1, 2, ..., k. If no such k exists, then no hypotheses are rejected.

This "step-up" sequential procedure, which has become increasingly popular in the literature, is easy to use because it is based solely on p-values from the individual tests. Benjamini and Hochberg (1995) first proved that this procedure—which is hereafter referred to as the BH procedure—controls the FDR for continuous test statistics and Benjamini and Yekutieli (2001) proved that this procedure also controls the FDR for discrete test statistics.

The original result in Benjamini and Hochberg (1995) was proved assuming independent tests corresponding to the true null hypotheses (although independence was not required for test statistics corresponding to the false null hypotheses). Benjamini and Yekutieli (2001) proved, however, that the BH procedure also controls the FDR for true null hypotheses with "positive regression dependence." This technical condition is satisfied for some test statistics of interest, such as one-sided multivariate normal tests with nonnegative correlations between tests, but is not satisfied for other statistics. More research is needed to assess whether the BH procedure is robust when independence and positive regression dependency are violated.

What Are Problems with These Solutions?

There are two related concerns with the adjustment procedures discussed above: (1) they result in tests with reduced statistical power and (2) they could result in tests with even less power when the test statistics are correlated (dependent).

Losses in Statistical Power
The statistical procedures that control for multiplicity reduce Type I error rates for individual tests. Consequently, these adjustment procedures result in tests with reduced statistical power—the probability of rejecting the null hypothesis given that the null hypothesis is false. Stated differently, these adjustment methods reduce the likelihood that the tests will identify true differences between the contrasted groups. The more conservative the multiple testing strategy, the greater the power loss.

Table B.4 demonstrates power losses based on the simulations discussed above when the FWER is controlled using the Bonferroni and Holm procedures and the FDR is controlled using the BH procedure.

The key findings from the table are as follows:

  • Power losses can be large using the Bonferroni and Holm procedures. The power of each method decreases with the number of tests. In the absence of multiple comparison adjustments, statistical power is 80 percent. Applying the Bonferroni correction reduces the power to 59 percent for 5 tests, 41 percent for 20 tests, and 31 percent for 50 tests. The Holm and Bonferroni procedures yield tests with similar power, but the Holm procedure performs slightly better if many impacts truly exist.
  • The power of the BH procedure increases with the number of intervention effects that truly exist. Statistical power is 55 percent if 80 percent of null hypotheses are true, compared to 74 percent if only 20 percent of null hypotheses are true. The power of the BH procedure does not vary with the number of tests.
  • Power losses are smaller for the BH procedure than for the Bonferroni and Holm procedures. Differences between the procedures become larger as (1) the number of tests increases and (2) the number of true null hypotheses decreases. Thus, power losses can be considerably smaller under the BH procedure if many contrasts truly differ.

These results suggest that multiplicity adjustments involve a tradeoff between Type I and Type II error rates. Conservative testing strategies, such as the Bonferroni and similar methods, can result in considerable losses in the statistical power of the tests, even if only a small number of tests are performed. The less conservative BH test has noticeably more power if a high percentage of all null hypotheses are false.

Dependent Test Statistics
Individual test statistics are likely to be related in many evaluations of educational interventions. Consider testing for intervention effects across many outcomes measured for the same subjects. In this case, the test statistics are likely to be correlated, because a common latent factor may affect the outcomes for the same individual and treatment effects may be correlated across outcomes. As another example, if multiple treatment alternatives are compared to each other or to the same control group, the test statistics are correlated because of the overlap in the samples across pairwise contrasts.

Some of the adjustment methods discussed above (such as the Bonferroni and Holm methods) control the FWER at a given α level when tests are correlated. However, for some forms of dependency, these methods may adjust significance levels for individual tests by more than is necessary to control the FWER. This could lead to further reductions in the statistical power of the tests. For example, if test correlations are positive and large, each test statistic is providing similar information about intervention effects, and thus, would likely produce similar p-values. Consequently, in these situations, fewer adjustments to Type I error rates are needed to control the FWER.

This problem can be demonstrated using the simulations discussed above. FWER values were calculated for 10 tests, all with true null hypotheses, when the correlation between all test statistics (ρ) ranged from 0 to 1. FWER values were calculated for unadjusted t-tests and using the Bonferroni and Holm adjustment methods (Table B.5).

The unadjusted FWERs become smaller as ρ increases (Table B.5). The FWER reduces from 0.40 for independent tests (ρ = 0) to 0.24 when ρ = 0.6 to 0.18 when ρ = 0.8. As a result, the Bonferroni and Holm methods “overcorrect” for multiplicity in this example and yield test statistics with reduced power.

Several methods discussed above adjust for dependency across test statistics. For example, the resampling methods incorporate general forms of correlational structures across tests, and the Tukey-Kramer, Dunnett, and Orthogonal Contrast methods account for specific forms of dependency when various treatments are compared to each other or to a common control group. Thus, power gains can be achieved using these methods. In addition, as discussed, the BH method controls the FDR under certain forms of dependency and for certain test statistics, but not for others.

Summary and Recommendations

The testing guidelines discussed in this report minimize the extent to which multiple testing adjustment procedures are needed by focusing on significance tests for composite domain outcomes. However, adjustment procedures for individual tests are needed for some testing situations, such as between-domain analyses, subgroup analyses, and designs with multiple treatment groups. Thus, this section provides general recommendations on suitable methods, although it should be emphasized there is not one statistical procedure that is appropriate for all settings; applicable methods will depend on the key research questions and the structure of the hypothesis tests.

To control the FWER, the bootstrap or permutation resampling methods are applicable for many testing situations because they incorporate general distributional and dependency structures across tests (Westfall and Young 1993). In educational evaluations, correlated test statistics are likely to be common. Thus, the resampling methods are recommended because they could yield tests with greater statistical power than other general-purpose methods that typically do not adjust for dependent data. The main disadvantages of the resampling methods are that they are difficult to explain and are computer intensive.

For FWER control, the Holm (1979) and Bonferroni procedures may also be suitable general-purpose methods. These methods (and especially the Bonferroni procedure) are easier to explain and apply than the resampling methods. However, although the Bonferroni and Holm methods control the FWER for dependent test statistics, they do not account for the dependency structure across tests and, thus, tend to have lower statistical power than the resampling methods. Statistical power is somewhat greater for the Holm method than the Bonferroni method if many impacts truly exist across the contrasts.

Hsu (1996) recommends alternative procedures to control the FWER in certain testing situations. The Tukey-Kramer (Tukey 1953, Kramer 1956) method is recommended if all pairwise comparisons of means are of primary interest, and the Dunnett (1955) method is recommended if multiple treatments are compared to a common control group. These methods can yield greater statistical power than other methods because they account for the exact nature of the dependency across test statistics due to the repetition of samples across contrasts. The use of orthogonal contrasts is another possibility if they correspond to key research questions.

The BH procedure (Benjamini and Hochberg 1995) controls the FDR, which is a different criterion than the FWER for defining the overall Type I error rate across the family of tests. As discussed, if many beneficial impacts truly exist, the BH procedure tends to have more statistical power than the methods that control the FWER by allowing more Type I errors. The philosophy of the BH procedure is that if many impacts are found to be statistically significant, a few more false positives will not change the overall validity of study conclusions about intervention effects. However, if few impacts are statistically significant (signaling that many null hypotheses are true), the BH and FWER-controlling methods are similar.

There are several issues that need to be considered when using the BH procedure. First, it may not control the FDR for all forms of dependency across test statistics. Second, the BH method may be less appropriate for some confirmatory analyses than methods that control the FWER, because it applies a less stringent standard for controlling Type I errors. On the other hand, the BH method may be appealing because it operates under the philosophy that conclusions regarding intervention effects are to be based on the preponderance of evidence and could lead to increases in the statistical power of the tests. The choice of whether the study aims to control the FDR or FWER is an important design issue that needs to be specified prior to the data analysis.

Top

1 Kirk (1994) provides an excellent introduction to the basics of statistical inference and multiple testing. Westfall et al. (1999), Shaffer (1995), and Hsu (1996) are excellent sources for more detailed discussions of the multiple comparisons problem. Savitz and Olshan (1995) discuss the multiple comparisons issue in the context of interpreting epidemiologic data.
2 Under a one-tailed test, the alternative hypothesis is H1j: δj >0 if larger values of the outcome are desirable or H1j: δj <0 if smaller values of the outcome are desirable.
3 Mathematically, the FDR equals E(B/Q|Q>0)P(Q>0).
4 For example, suppose that there are 10 tests and that the null hypothesis is true for 9 tests. Suppose also that the true contrast for the 10th test is so large that the null hypothesis for the composite F-test would always be rejected. In this case, the Type I error rate would not be controlled for the second-stage t-tests, because there would be a 37 percent chance that these second-stage tests would reject at least one of the 9 tests with true null hypotheses.