Best (but oft-forgotten) practices: the multiple problems of multiplicity-whether and how to correct for many statistical tests

被引:242
作者
Streiner, David L. [1 ,2 ]
机构
[1] McMaster Univ, Dept Psychiat & Behav Neurosci, Hamilton, ON, Canada
[2] Univ Toronto, Dept Psychiat, Toronto, ON, Canada
关键词
multiplicity; significance testing; statistics; Bonferroni; false discovery rate; family wise error rate; TRIALS; POWERFUL; DESIGN;
D O I
10.3945/ajcn.115.113548
中图分类号
R15 [营养卫生、食品卫生]; TS201 [基础科学];
学科分类号
100403 ;
摘要
Testing many null hypotheses in a single study results in an increased probability of detecting a significant finding just by chance (the problem of multiplicity). Debates have raged over many years with regard to whether to correct for multiplicity and, if so, how it should be done. This article first discusses how multiple tests lead to an inflation of the a level, then explores the following different contexts in which multiplicity arises: testing for baseline differences in various types of studies, having >1 outcome variable, conducting statistical tests that produce >1 P value, taking multiple "peeks" at the data, and unplanned, post hoc analyses (i.e., "data dredging," "fishing expeditions," or "P-hacking"). It then discusses some of the methods that have been proposed for correcting for multiplicity, including single-step procedures (e.g., Bonferroni); multistep procedures, such as those of Holm, Hochberg, and Sidak; false discovery rate control; and resampling approaches. Note that these various approaches describe different aspects and are not necessarily mutually exclusive. For example, resampling methods could be used to control the false discovery rate or the family-wise error rate (as defined later in this article). However, the use of one of these approaches presupposes that we should correct for multiplicity, which is not universally accepted, and the article presents the arguments for and against such "correction." The final section brings together these threads and presents suggestions with regard to when it makes sense to apply the corrections and how to do so.
引用
收藏
页码:721 / 728
页数:8
相关论文
共 41 条
  • [1] Altman D., 1985, Statistician, V34, P125
  • [2] Altman DG, 2000, STAT MED, V19, P3275, DOI 10.1002/1097-0258(20001215)19:23<3275::AID-SIM626>3.0.CO
  • [3] 2-M
  • [4] [Anonymous], 2009, NEURAL CORRELATES IN
  • [5] [Anonymous], 1993, Resampling-based multiple testing: Examples and methods for p-value adjustment
  • [6] [Anonymous], 2013, THESIS COLUMBIA U NE
  • [7] REPEATED SIGNIFICANCE TESTS ON ACCUMULATING DATA
    ARMITAGE, P
    MCPHERSO.CK
    ROWE, BC
    [J]. JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES A-GENERAL, 1969, 132 : 235 - &
  • [8] Testing multiple statistical hypotheses resulted in spurious associations: A study of astrological signs and health
    Austin, Peter C.
    Mamdani, Muhammad M.
    Juurlink, David N.
    Hux, Janet E.
    [J]. JOURNAL OF CLINICAL EPIDEMIOLOGY, 2006, 59 (09) : 964 - 969
  • [9] CONTROLLING THE FALSE DISCOVERY RATE - A PRACTICAL AND POWERFUL APPROACH TO MULTIPLE TESTING
    BENJAMINI, Y
    HOCHBERG, Y
    [J]. JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY, 1995, 57 (01) : 289 - 300
  • [10] Comparisons of Methods for Multiple Hypothesis Testing in Neuropsychological Research
    Blakesley, Richard E.
    Mazumdar, Sati
    Dew, Mary Amanda
    Houck, Patricia R.
    Tang, Gong
    Reynolds, Charles F., III
    Butters, Meryl A.
    [J]. NEUROPSYCHOLOGY, 2009, 23 (02) : 255 - 264