Skip Navigation
Technical Methods Report: Guidelines for Multiple Testing in Impact Evaluations

NCEE 2008-4018
May 2008

Appendix D: The Bayesian Hypothesis Testing Framework

This appendix summarizes key features of the Bayesian testing approach, which is the main alternative to the classical testing approach. The use of these methods in IES studies is an important area for future research. Spiegelhalter et al. (1994) and Gelman and Tuerlinckx (2000) provide a more detailed discussion of the Bayesian framework.

In the Bayesian view, assessing the effects of an intervention is a dynamic process in which any individual study takes place in a context of continuously increasing knowledge. Initial beliefs about treatment effects are incorporated into the analysis and are expressed as a prior distribution. The prior distribution could be based on objective evidence or subjective judgment, and the shape and location of this distribution reflects the level of confidence in the prior information.

Using Bayes theorem, the prior distribution of the impact, fj), is combined with the conditional distribution of the observed data given the impact, g(data | δj), to obtain a posterior (updated) distribution of the treatment effect:

hj | data) ∝g(data | δj) fj)

The Bayesian impact estimate is the mean of the posterior distribution. If both the prior and conditional distributions are normally distributed, the mean of the posterior distribution is a weighted average of the observed impact and the mean of the prior distribution (where weights are inversely related to variances of the likelihood and prior distributions). Thus, the Bayesian approach “shrinks” the observed impact estimate to the mean of the prior distribution. The Bayesian approach addresses the following question: “What is the updated evidence on the impact, once we combine the previous with the new evidence?”

Differences between the Bayesian and classical analyses include the incorporation of prior beliefs, the absence of p-values, and the absence of the idea of hypothetical repetitions of the sampling process. The posterior estimate of the impact and its uncertainty as measured by a credibility interval is analogous to the classical differences-in-means point estimate and its associated confidence interval. This credibility interval, however, has a direct interpretation in terms of belief; probabilistic statements can be made about the size of the impact. Those who misinterpret classical confidence intervals as the region in which the effect is likely to lie are, in essence, adopting a Bayesian point of view.

Gelman et al. (2007) argue that multiple comparisons issues typically are less of a concern in Bayesian modeling than in classical inference. This is because under the Bayesian approach, the impact estimates for the various contrasts and their credibility intervals are shifted toward each other. This leads to wider confidence bands under the Bayesian approach. For example, under the classical approach, the usual 95 percent confidence interval for the impact, δjTjCj, is [yTj -yCj ±1.96 √2/n], where yTj and yCj are sample means of the outcome measure for treatments and controls, respectively; σ2 is the variance of the outcome measure; and n is the treatment (control) group sample size. If the likelihood and prior distributions are normally distributed, the 95 percent Bayesian credibility interval based on the posterior distribution is [(yTj -yCj) ±1.96 √(2σ2/n)(1 +(σ2/n)/τ2)] where τ2 is the variance of the prior distribution for δj. Thus, statistical significance is less likely to be found under the Bayesian than classical approach. Consequently, the Bayesian approach is conservative and appropriately accounts for multiple comparisons in many instances.

More research is needed about the applicability of the Bayesian approach in IES-funded experimental studies. In particular, a critical issue is whether credible prior distributions on intervention effects can be specified. This will depend on the credibility of the empirical evidence on the effects of similar interventions to the ones being tested. It will also depend on the extent to which theory can be used to structure the multidimensional data so that empirical Bayes methods can be used to formulate prior distributions from the data. For example, prior distributions for a specific site (or outcome) could be estimated using the combined impact estimates for similar sites (or outcomes) if there is a theoretical justification for these groupings.

Top