Technical Methods Report: Guidelines for Multiple Testing in Impact Evaluations

NCEE 2008-4018
May 2008

References

Altman, D.G., K.F. Schulz, and D. Moher (2001). "The Revised CONSORT Statement for Reporting Randomized Trials: Explanation and Elaboration." Annals of Internal Medicine, 134, 663-694.

Bechhofer, R. and C. Dunnett (1982). "Multiple Comparisons for Orthogonal Contrasts: Examples and Tables." Technometrics, 24(3), 213-222.

Benjamini, Y. and Y. Hochberg (1995). "Controlling the False Discovery Rate: A New and Powerful Approach to Multiple Testing." Journal of the Royal Statistical Society, Series B, 57, 1289-1300.

Benjamini, Y. and D. Yekutieli (2001). "The Control of the False Discovery Rate in Multiple Testing Under Dependency." The Annals of Statistics, 29(4), 1165-1188.

Bobko, P., P. Roth, and M. Buster (2007). "The Usefulness of Unit Weights in Creating Composite Scores." Organizational Research Methods, 109(4), 689-709.

Brookes, S.T., E. Whitley, T.J. Peters, P.A. Mulheran, M. Egger, and G. Smith (2001). "Subgroup Analyses in Randomized Controlled Trials: Quantifying the Risks of False-Positives and False- Negatives." Health Technology Assessment, 5(33), 1-49.

Committee for Proprietary Medicinal Products (CPMP) (2002). "Points to Consider on Multiplicity Issues in Clinical Trials." London: The European Agency for the Evaluation of Medicinal Products (EMEA).

Cook, R. and V. Farewell (1996). "Multiplicity Considerations in the Design and Analysis of Clinical Trials." Journal of the Royal Statistical Society, Series A, 159, 93-110.

Curran-Everett, D. (2000). "Multiplicity Comparisons: Philosophies and Illustration." American Journal of Physiology, vol. R1-R8.

Duncan, D.B. (1955). "Multiple Range and Multiple F-Tests." Biometrics, 11, 1-42.

Dunnett, C.W. (1955). "A Multiple Comparison Procedure for Comparing Several Treatments with a Control." Journal of the American Statistical Association, 50, 1096-1121.

Fisher, R.A. (1935). The Design of Experiments. Edinburgh and London: Oliver and Boyd.

Freeman, M.F. and J.W. Tukey (1950). "Transformations Related to the Angular and Square Root." Annals of Mathematical Statistics, 21, 607-611.

Gelman, A., J. Hill, and M. Yajima (2007). "Why We (Usually) Don't Worry About Multiple Comparisons." Columbia University Working Paper. New York: Columbia University.

Gelman, A. and H. Stern (2006). "The Difference Between "Significant" and "Not Significant" Is Not Itself Statistically Significant." The American Statistician, 60(4), 328-331.

Gelman, A. and F. Tuerlinckx (2000). "Type S Error Rates for Classical and Bayesian Single and Multiple Comparison Procedures." Columbia University Working Paper. New York: Columbia University.

Gordon, A., G. Glazko, Z. Qiu, and A. Yakovlev (2007). "Control of the Mean Number of False Discoveries, Bonferroni and Stability of Multiple Testing." The Annals of Applied Statistics, 179- 190.

Gulliksen, H. (1950). Theory of Mental Health. New York: Wiley.

Harris, R.J. (1975). A Primer of Multivariate Statistics. New York: Academic Press, Inc.

Hochberg, Y. (1988). "A Sharper Bonferroni Procedure for Multiple Tests of Significance." Biometrika, 75, 800-802.

Holm, S. (1979). "A Simple Sequentially Rejective Multiple Test Procedure." Scandinavian Journal of Statistics, 6, 65-70.

Hsu, J.C. (1996). Multiple Comparisons: Theory and Methods. London: Chapman and Hall.

Kane, M. and S. Case (2004). "The Reliability and Validity of Weighted Composite Scores." Applied Measurement in Education, 17(3), 221-240.

Keuls, M. (1952). "The Use of the 'Studentized Range' in Connection with an Analysis of Variance." Euphytica, 1, 112-122.

Kirk, M. (1994). Experimental Design: Procedures for the Behavioral Sciences. Pacific Grove, CA: Brooks/Cole.

Kramer, C.Y. (1956). "Extension of the Multiple Range Test to Group Means with Unequal Numbers of Replications." Biometrics, 12, 307-310.

Landis, R., D. Beal, and P. Tesluk (2000). "A Comparison of Approaches to Forming Composite Measures in Structural Equation Models." Organizational Research Methods, 3(2), 186-207.

Lang, T. and M. Secic (2007). How to Report Statistics in Medicine, 2nd ed. Philadelphia: American College of Physicians.

Lord, F.M. (1980). Applications of Item Response Theory to Practical Testing Problems. New Jersey: Lawrence Erlbaum Associates, Inc.

Newman, D. (1939). "The Distribution of the Range in Samples From a Normal Population, Expressed in Terms of an Independent Estimate of Standard Deviation." Biometrika, 35, 16-31.

Raju, N., R. Bilgic, J. Edwards, and P. Fleer (1997). "Methodology Review: Estimation of Population and Cross Validity and the Use of Equal Weights in Prediction." Applied Psychological Measurement, 21, 291-305.

Rom, D.M. (1990). "A Sequentially Rejective Test Procedure Based on a Modified Bonferroni Inequality." Biometrika, 77, 663-665.

Rothwell, P.M. (2005). "Subgroup Analyses in Randomized Controlled Trials: Importance, Indications, and Interpretation." The Lancet, 365, 176-186.

Saville, D.J. (1990). "Multiple Comparison Procedures: The Practical Solution." The American Statistician, 44, 174-180.

Savitz, D. and F. Olshan (1995). "Multiple Comparisons and Related Issues in the Interpretation of Epidemiologic Data." American Journal of Epidemiology, 142(9), 904-908.

Scheffé, H. (1959). The Analysis of Variance. New York: John Wiley & Sons.

Shaffer, J. (1995). "Multiple Hypothesis Testing." Annual Review of Psychology, 46, 561-584.

Šidák, Z. (1967). "Rectangular Confidence Regions for the Means of Multivariate Normal Distributions." Journal of the American Statistical Association, 62, 626-633.

Spiegelhalter, D., L.S. Freedman, and M. Parmar (1994). "Bayesian Approaches to Randomized Trials." Journal of the Royal Statistical Society, Series A, 357-416.

Storey, J.D. (2002). "A Direct Approach to False Discovery Rates." Journal of the Royal Statistical Society, Series B, 64, 479-498.

Tukey, J.W. (1953). "The Problem of Multiple Comparisons." In Mimeographed Notes. Princeton, NJ: Princeton University.

Wainer, H. and D. Thissen (1993). "Combining Multiple-Choice and Constructed-Response Test Scores: Toward a Marxist Theory of Test Construction." Applied Measurement in Education, 6(2), 103-118.

Wang, M. and J. Stanley (1970). "Differential Weighting: A Review of Methods and Empirical Studies." Review of Educational Research, 40, 663-705.

Westfall, P.H, Y. Lin, and S. Young (1990). "Resampling-Based Multiple Testing." In Proceedings of the Fifteenth Annual SAS Users Group International. Cary, NC: SAS Institute, Inc., 1359-1364.

Westfall, P.H., R. Tobias, D. Rom, R. Wolfinger, and Y. Hochberg (1999). Multiple Comparisons and Multiple Tests Using SAS. Cary, NC: SAS Institute, Inc.

Westfall, P.H. and R.D. Wolfinger (1997). "Multiple Tests with Discrete Distributions." The American Statistician, 51, 3-8.

Westfall, P.H. and S.S. Young (1993). Resampling-Based Multiple Testing: Examples and Methods for P-Value Adjustment. New York: John Wiley & Sons.

Wilks, S. (1938). "Weighting Systems for Linear Functions of Correlated Variables When There Is No Dependent Variable." Psychometrika, 3, 23-40.

Top