This appendix introduces the hypothesis testing framework for this report, the multiple testing problem, statistical methods to adjust for multiplicity, and some concerns that have been raised about these solutions. The goal is to provide an intuitive, nontechnical discussion of key issues related to this complex topic to help education researchers apply the guidelines presented in the report. A comprehensive review of the extensive literature in this area is beyond the scope of this introductory discussion. The focus is on continuous outcomes, but appropriate procedures are highlighted for other types of outcomes (such as binary outcomes). The appendix concludes with recommended methods.1
The Hypothesis Testing Framework
In this report, it has been assumed that a classical (frequentist) hypothesis
testing approach is used to analyze the data; this is the testing strategy that
is typically used in IES evaluations. This section highlights key features of this
approach. Appendix D summarizes key features of the alternative
Bayesian testing approach.
To describe the classical approach, it is assumed that treatment and control groups are randomly selected from a known population and that data on multiple outcomes are collected on each sample member. For contrast j, let μTj and μCj be population means for the treatment and control groups (or two treatment groups), respectively, and let δj=μTj -μCj be the population average treatment effect (impact). In the classical framework, population means and, hence, population impacts are assumed to be fixed.
Statistical analysis under this approach usually centers on a significance test—such as a two-tailed ttest— of a null hypothesis H0j: δj =0 versus the alternative hypothesis H1j: δ0j ≠0.2 The Type I error rate—the probability of rejecting H0j given that it is true—is typically set at α =5 percent for each test. Evaluation sample sizes are typically determined so that statistical power—the probability of rejecting H0j given that it is false—is 80 percent if the true impact is equal to a value that is deemed to be educationally meaningful or realistically attainable by the intervention.
Under this framework, a null hypothesis is typically rejected (that is, an impact is declared statistically significant) if the p-value for the statistical test is less than 5 percent, or equivalently, if the 95 percent confidence interval for the contrast does not contain zero. This is a frequentist approach, because in hypothetical repeated sampling of population subjects to the treatment and control groups, 95 percent of the constructed confidence intervals would contain the true, fixed population impact. Probabilistic statements can be made about the random confidence interval but not about the fixed impact.
What Is the Multiple Testing Problem?
Researchers typically perform many simultaneous hypothesis tests when analyzing
experimental data. Multiple tests are conducted to assess intervention effects (treatment-control
differences) across multiple outcomes (endpoints). In some evaluations, multiple
tests are also conducted to assess differences in intervention effects across multiple
treatment groups (such as those defined by various reading or math curricula) or
population subgroups (such as student subgroups defined by age, gender, race/ethnicity,
or baseline risk factors).
In such instances, separate t-tests for each contrast are often performed to test the null hypothesis of no impacts, where the Type I error rate is typically set at α = 5 percent for each test. Thus, for each test, the chance of erroneously finding a statistically significant impact is 5 percent. However, when the "family" of hypothesis tests are considered together, the "combined" Type I error rate could be considerably larger than 5 percent. This is the heart of the multiple testing problem.
For example, suppose that the null hypothesis is true for each test and that the tests are independent. Then, the chance of finding at least one spurious impact is 1 - (1 -α)N , where N is the number of tests. Thus, the probability of making at least one Type I error is 23 percent if 5 tests are conducted, 64 percent for 20 tests, and 92 percent for 50 tests (Table B.1).
The definition of the combined Type I error rate has implications for the strategies used to adjust for multiple testing and for interpreting the impact findings. The next section discusses the most common definitions found in the literature and provides numerical examples.
Definitions of the Combined Type I Error Rate
The two most common definitions of the combined Type I error rate found in the literature
are the (1) familywise error rate (FWER) and (2) false discovery rate
(FDR):
Table B.2 helps clarify these two error rates. Suppose that multiple tests are conducted to assess intervention effects on N study outcomes and that M null hypotheses are true (M is unobservable). Suppose further that based on t-tests, Q null hypotheses are rejected and that A, B, C, and D signify cell counts when t-test results are compared to the truth. The counts Q and A to D are random variables.
In Table B.2, the FWER is the probability that the random variable B is at least 1 among the M null hypotheses that are true. The FDR equals the expected value of B /Q, where B /Qis defined to equal 0 if Q = 0.3 If all null hypotheses are true, then B =Q and the FDR and FWER are equivalent, otherwise the FDR is smaller than or equal to the FWER.
The two error rates have a different philosophical basis. The FWER measures the likelihood of a single erroneous rejection of the null hypothesis across the family of tests. Researchers who focus on the FWER are concerned with mistakenly reporting any statistically significant findings. The concerns are that unwarranted scientific conclusions about evaluation findings could be made as a result of even one mistake and that researchers may select erroneous significant findings for emphasis when reporting and publishing results.
The rationale behind the FDR is that a few erroneous rejections may not be as problematic for drawing conclusions about the family tested when many null hypotheses are rejected as they would be if only a few null hypotheses are rejected. The rejection of many null hypotheses is a signal that there are real differences across the contrasted groups. Thus, researchers might be willing to tolerate more false positives (that is, larger values for B) if realized values for Q were large than if they were small. Under this approach, conclusions regarding intervention effects are to be based on the preponderance of evidence; the set of discoveries is to be used to reach an overall decision about the treatment. For those who adopt this approach, controlling the FWER is too conservative because, if many significant effects are found, a few additional errors will not change the overall validity of study findings.
Finally, variants of the FWER and FDR have been proposed in the literature. For example, Gordon et al. (2007) discuss a variant of the FWER—the per family error rate (PFER)—which is the expected number of Type I errors that are made (that is, the expected value of B in Table B.2). For instance, the PFER equals 5 if 5 of all tests with true null hypotheses are expected to be statistically significant. Gordon et al. (2007) argue that the use of the PFER (which focuses on expectations) rather than the FWER (which focuses on probabilities) could yield tests with more statistical power while maintaining stringent standards for Type I error rates across the family of tests. Storey (2002) introduced a variant of the FDR—the positive false discovery rate (pFDR)—which is the expected value of B /Q given that Q is positive. He argues that this measure may be more appropriate in some instances.
Quantifying the FWER and FDR
To demonstrate the relationship between the FDR and the FWER, simulated data were
generated on N mean outcomes for samples of 1,000 treatment and 1,000 control
group members (Table B.3). Data for mean outcome j were obtained as random
draws from a normal distribution with standard deviation 1 and with mean μCj
for controls and μTj for treatments. The impacts (μTj
-μCj) were set to 0 for M outcomes with true null
hypotheses and to 0.125 for (N -M) outcomes with false null hypotheses;
the 0.125 value was set so that the statistical power of the tests was 80 percent.
Each individual hypothesis was tested using a two-tailed t-test at the
5 percent significance level and the test statistics were generated independently.
Each simulation involved 10,000 repetitions. Simulations were conducted for N
= 5, 10, 20, and 50 and for M /N = 100, 80, 50, and 20 percent.
The main results in Table B.3 are as follows:
These results suggest that the FDR is a less conservative measure than the FWER, especially if a considerable fraction of all null hypotheses are false. Thus, as demonstrated below, methods that control the FDR could yield tests with greater statistical power than those that control the FWER. The choice of which error criterion to control is important and must be made prior to the data analysis.
What Are Statistical Solutions to the Multiple Testing Problem?
A large body of literature describes statistical methods to adjust Type I errors
for multiple testing (see, for example, the books by
Westfall et al. 1999, Hsu 1996, and
Westfall and Young 1993). The literature suggests that there is not
one method that is preferred in all instances. Rather, the appropriate measure will
depend on the study design, the primary research questions that are to be addressed,
and the strength of inferences that are required.
This section briefly summarizes the literature in this area. Methods that control the FWER are discussed first, and methods that control the FDR are discussed second. Statistical packages (such as SAS) can be used to apply many of these methods.
Methods for FWER Control
Until recently, most of the literature on multiple testing focused on methods to
control the FWER at a given α level (that is, methods to ensure that
the FWER ≤ α). The most well-known method is the Bonferroni
procedure, which sets the significance level for individual tests at α/N,
where N is the number of tests.
The Bonferroni procedure controls the FWER when all null hypotheses are true or when some are true and some are false (that is, it provides "strong" control of the FWER). This feature differs from another well-known procedure—Fisher's protected least significant difference (LSD)—where an overall F-test across the tests is first conducted at the α level and further comparisons about individual contrasts are conducted at the α level only if the F-test is significant (Fisher 1935). Fisher's LSD controls the FWER only when all null hypotheses are true and, thus, provides "weak" control of the FWER. This means that Fisher's LSD may not control the FWER for second-stage individual hypotheses.4 The same issue applies to other multiple-stage tests, such as the Newman-Keuls (Newman 1939, Keuls 1952) and Duncan (1955) methods.
The Bonferroni method applies to both continuous and discrete data, controls the FWER when the tests are correlated, and provides adjusted confidence bounds (by using α/N rather than α in the calculations). Furthermore, it is flexible because it controls the FWER for tests of joint hypotheses about any subset of N separate hypotheses (including individual contrasts). The procedure will reject a joint hypothesis H0 if any p-value for the individual hypotheses included in H0 is less than α/N. The Bonferroni method, however, yields conservative bounds on Type I error and, hence, has low power.
Many modified and sometimes more powerful versions of the Bonferroni method have been developed that provide strong control of the FWER. We provide several examples:
Bootstrap and permutation resampling methods are alternative, computer-intensive methods that provide strong control of the FWER (see, for example, Westfall and Young 1993 and Westfall et al. 1990). These methods incorporate distributional and correlational structures across tests, so they tend to be less conservative than the other general-purpose methods and, hence, may have more power. Furthermore, they are applicable in many testing situations. These methods can be applied as follows:
The intuition behind this procedure is that the distribution of the maximum t-statistic (minimum p-value) provides simultaneous confidence intervals that apply to all tests under the null hypothesis. The resampling methods use the data to estimate this distribution, which yields the multiplicity-adjusted p-values. In essence, a hypothesis test is rejected if the actual t-statistic value for that test is in the tail of the maximum t-statistic distribution.
Alternative methods to control the FWER have been developed when the design contains several treatment and control groups. The Tukey-Kramer (Tukey 1953, Kramer 1956) method is applicable if all pairwise comparisons of treatment and control means are of primary interest. If comparisons with a single control group are of primary interest, the Dunnett (1955) method is appropriate. These methods account for the dependence across test statistics due to the repetition of samples across contrasts.
The use of planned orthogonal contrasts is another method that adjusts for dependency when T treatment and control groups are compared to each other (see, for example, Bechhofer and Dunnett 1982). To describe this procedure, let Yi be a mean outcome (composite) for research group i and let where cji are constants such that =0 (j =1,...,(T -1)). The Cjs represent a family of (T -1) contrasts (linear combinations) of the Yis. Mutually orthogonal contrasts arise if sample sizes are the same in each treatment condition and for all j≠k. A property of orthogonal contrasts is that the total sum of squares across the T research groups can be partitioned into (T –1) sums of squares for each orthogonal contrast.
Significance testing can be performed for each orthogonal contrast using the multiple comparisons adjustment procedures discussed above. An advantage of this method is that testing problems associated with dependent test statistics disappear. Furthermore, if T is large, the use of orthogonal contrasts requires fewer test statistics than the Tukey-Kramer procedure, thereby reducing the multiple comparisons problem.
The use of planned orthogonal contrasts may be desirable if they correspond to key research questions and can be easily interpreted. However, this approach may not be appropriate if the key contrasts of interest are not mutually orthogonal.
Finally, special tests for controlling the FWER have been developed for binary or discrete data (see, for example, Westfall et al. 1999). Resampling methods or a modified Bonferroni method (Westfall and Wolfinger 1997) can be used in these instances. An alternative is the Freeman-Tukey Double Arcsine Test (Freeman and Tukey 1950).
Methods for FDR Control
Benjamini and Hochberg (1995) showed that when conducting N tests, the following
four-step procedure will control the FDR at the α level:
This "step-up" sequential procedure, which has become increasingly popular in the literature, is easy to use because it is based solely on p-values from the individual tests. Benjamini and Hochberg (1995) first proved that this procedure—which is hereafter referred to as the BH procedure—controls the FDR for continuous test statistics and Benjamini and Yekutieli (2001) proved that this procedure also controls the FDR for discrete test statistics.
The original result in Benjamini and Hochberg (1995) was proved assuming independent tests corresponding to the true null hypotheses (although independence was not required for test statistics corresponding to the false null hypotheses). Benjamini and Yekutieli (2001) proved, however, that the BH procedure also controls the FDR for true null hypotheses with "positive regression dependence." This technical condition is satisfied for some test statistics of interest, such as one-sided multivariate normal tests with nonnegative correlations between tests, but is not satisfied for other statistics. More research is needed to assess whether the BH procedure is robust when independence and positive regression dependency are violated.
What Are Problems with These Solutions?
There are two related concerns with the adjustment procedures discussed above: (1)
they result in tests with reduced statistical power and (2) they could result in
tests with even less power when the test statistics are correlated (dependent).
Losses in Statistical Power
The statistical procedures that control for multiplicity reduce Type I error rates
for individual tests. Consequently, these adjustment procedures result in tests
with reduced statistical power—the probability of rejecting the null hypothesis
given that the null hypothesis is false. Stated differently, these adjustment methods
reduce the likelihood that the tests will identify true differences between
the contrasted groups. The more conservative the multiple testing strategy, the
greater the power loss.
Table B.4 demonstrates power losses based on the simulations discussed above when the FWER is controlled using the Bonferroni and Holm procedures and the FDR is controlled using the BH procedure.
The key findings from the table are as follows:
These results suggest that multiplicity adjustments involve a tradeoff between Type I and Type II error rates. Conservative testing strategies, such as the Bonferroni and similar methods, can result in considerable losses in the statistical power of the tests, even if only a small number of tests are performed. The less conservative BH test has noticeably more power if a high percentage of all null hypotheses are false.
Dependent Test Statistics
Individual test statistics are likely to be related in many evaluations of educational
interventions. Consider testing for intervention effects across many outcomes measured
for the same subjects. In this case, the test statistics are likely to be correlated,
because a common latent factor may affect the outcomes for the same individual and
treatment effects may be correlated across outcomes. As another example, if multiple
treatment alternatives are compared to each other or to the same control group,
the test statistics are correlated because of the overlap in the samples across
pairwise contrasts.
Some of the adjustment methods discussed above (such as the Bonferroni and Holm methods) control the FWER at a given α level when tests are correlated. However, for some forms of dependency, these methods may adjust significance levels for individual tests by more than is necessary to control the FWER. This could lead to further reductions in the statistical power of the tests. For example, if test correlations are positive and large, each test statistic is providing similar information about intervention effects, and thus, would likely produce similar p-values. Consequently, in these situations, fewer adjustments to Type I error rates are needed to control the FWER.
This problem can be demonstrated using the simulations discussed above. FWER values were calculated for 10 tests, all with true null hypotheses, when the correlation between all test statistics (ρ) ranged from 0 to 1. FWER values were calculated for unadjusted t-tests and using the Bonferroni and Holm adjustment methods (Table B.5).
The unadjusted FWERs become smaller as ρ increases (Table B.5). The FWER reduces from 0.40 for independent tests (ρ = 0) to 0.24 when ρ = 0.6 to 0.18 when ρ = 0.8. As a result, the Bonferroni and Holm methods “overcorrect” for multiplicity in this example and yield test statistics with reduced power.
Several methods discussed above adjust for dependency across test statistics. For example, the resampling methods incorporate general forms of correlational structures across tests, and the Tukey-Kramer, Dunnett, and Orthogonal Contrast methods account for specific forms of dependency when various treatments are compared to each other or to a common control group. Thus, power gains can be achieved using these methods. In addition, as discussed, the BH method controls the FDR under certain forms of dependency and for certain test statistics, but not for others.
Summary and Recommendations
The testing guidelines discussed in this report minimize the extent to which multiple
testing adjustment procedures are needed by focusing on significance tests for composite
domain outcomes. However, adjustment procedures for individual tests are needed
for some testing situations, such as between-domain analyses, subgroup analyses,
and designs with multiple treatment groups. Thus, this section provides general
recommendations on suitable methods, although it should be emphasized there is not
one statistical procedure that is appropriate for all settings; applicable methods
will depend on the key research questions and the structure of the hypothesis tests.
To control the FWER, the bootstrap or permutation resampling methods are applicable for many testing situations because they incorporate general distributional and dependency structures across tests (Westfall and Young 1993). In educational evaluations, correlated test statistics are likely to be common. Thus, the resampling methods are recommended because they could yield tests with greater statistical power than other general-purpose methods that typically do not adjust for dependent data. The main disadvantages of the resampling methods are that they are difficult to explain and are computer intensive.
For FWER control, the Holm (1979) and Bonferroni procedures may also be suitable general-purpose methods. These methods (and especially the Bonferroni procedure) are easier to explain and apply than the resampling methods. However, although the Bonferroni and Holm methods control the FWER for dependent test statistics, they do not account for the dependency structure across tests and, thus, tend to have lower statistical power than the resampling methods. Statistical power is somewhat greater for the Holm method than the Bonferroni method if many impacts truly exist across the contrasts.
Hsu (1996) recommends alternative procedures to control the FWER in certain testing situations. The Tukey-Kramer (Tukey 1953, Kramer 1956) method is recommended if all pairwise comparisons of means are of primary interest, and the Dunnett (1955) method is recommended if multiple treatments are compared to a common control group. These methods can yield greater statistical power than other methods because they account for the exact nature of the dependency across test statistics due to the repetition of samples across contrasts. The use of orthogonal contrasts is another possibility if they correspond to key research questions.
The BH procedure (Benjamini and Hochberg 1995) controls the FDR, which is a different criterion than the FWER for defining the overall Type I error rate across the family of tests. As discussed, if many beneficial impacts truly exist, the BH procedure tends to have more statistical power than the methods that control the FWER by allowing more Type I errors. The philosophy of the BH procedure is that if many impacts are found to be statistically significant, a few more false positives will not change the overall validity of study conclusions about intervention effects. However, if few impacts are statistically significant (signaling that many null hypotheses are true), the BH and FWER-controlling methods are similar.
There are several issues that need to be considered when using the BH procedure. First, it may not control the FDR for all forms of dependency across test statistics. Second, the BH method may be less appropriate for some confirmatory analyses than methods that control the FWER, because it applies a less stringent standard for controlling Type I errors. On the other hand, the BH method may be appealing because it operates under the philosophy that conclusions regarding intervention effects are to be based on the preponderance of evidence and could lead to increases in the statistical power of the tests. The choice of whether the study aims to control the FDR or FWER is an important design issue that needs to be specified prior to the data analysis.