The earth is flat (p > 0.05): significance thresholds and the crisis of unreplicable research

被引:185
作者
Amrhein, Valentin [1 ,2 ,3 ]
Korner-Nievergelt, Franzi [3 ,4 ]
Roth, Tobias [1 ,2 ]
机构
[1] Univ Basel, Zool Inst, Basel, Switzerland
[2] Res Stn Petite Camargue Alsacienne, St Louis, France
[3] Swiss Ornithol Inst, Sempach, Switzerland
[4] Oikostat GmbH, Ettiswil, Switzerland
来源
PEERJ | 2017年 / 5卷
基金
瑞士国家科学基金会;
关键词
P-value; Significant; Nonsignificant; Threshold; Publication bias; Truth inflation; Winner's curse; Reproducibility; Replicability; Graded evidence; FILE DRAWER PROBLEM; CONFIDENCE-INTERVALS; STATISTICAL SIGNIFICANCE; CUMULATIVE KNOWLEDGE; BEHAVIORAL ECOLOGY; REVISED STANDARDS; PUBLICATION BIAS; 05; LEVEL; NULL; HYPOTHESIS;
D O I
10.7717/peerj.3544
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
The widespread use of 'statistical significance' as a license for making a claim of a scientific finding leads to considerable distortion of the scientific process (according, to the American Statistical Association). We review why degrading p-vaIues into 'significant' and 'nonsignificant' contributes to making studies irreproducible, or to making them seem irreproducible. A major problem is that we tend to take small p values at face value, but mistrust results with larger p-values. In either case, p-values tell little about reliability of research, because they are hardly replicable even if an alternative hypothesis is true. Also significance (p < 0.05) is hardly replicable: at a good statistical power of 80%, two studies will be 'conflicting', meaning that one is significant and the other is not, in one third of the cases if there is a true effect. A replication can therefore not be interpreted as having failed only because it is nonsignificant. Many apparent replication failures may thus reflect faulty judgment based on significance thresholds rather than a crisis of unreplicable research. Reliable conclusions on replicability and practical importance of a finding can only be drawn using cumulative evidence from multiple independent studies. However, applying significance thresholds makes cumulative knowledge unreliable. One reason is that with anything but ideal statistical power, significant effect sizes will be biased upwards. Interpreting inflated significant results while ignoring nonsignificant results will thus lead to wrong conclusions. But current incentives to hunt for significance lead to selective reporting and to publication bias against nonsignificant findings. Data dredging, p-hacking, and publication bias should be addressed by removing fixed significance thresholds. Consistent with the recommendations of the late Ronald Fisher, p-values should be interpreted as graded measures of the strength of evidence against the null hypothesis. Also larger p-values offer some evidence against the null hypothesis, and they cannot be interpreted as supporting the null hypothesis, falsely concluding that 'there is no effect'. Information on possible true effect sizes that are compatible with the data must be obtained from the point estimate, e.g., from a sample average, and from the interval estimate, such as a confidence interval. We review how confusion about interpretation of larger p values can be traced back to historical disputes among the founders of modern statistics. We further discuss potential arguments against removing significance thresholds, for example that decision rules should rather be more stringent, that sample sizes could decrease, or that p-values should better be completely abandoned. We conclude that whatever method of statistical inference we use, dichotomous threshold thinking must give way to non-automated informed judgment.
引用
收藏
页数:40
相关论文
共 211 条
  • [1] Estimating the reproducibility of psychological science
    Aarts, Alexander A.
    Anderson, Joanna E.
    Anderson, Christopher J.
    Attridge, Peter R.
    Attwood, Angela
    Axt, Jordan
    Babel, Molly
    Bahnik, Stepan
    Baranski, Erica
    Barnett-Cowan, Michael
    Bartmess, Elizabeth
    Beer, Jennifer
    Bell, Raoul
    Bentley, Heather
    Beyan, Leah
    Binion, Grace
    Borsboom, Denny
    Bosch, Annick
    Bosco, Frank A.
    Bowman, Sara D.
    Brandt, Mark J.
    Braswell, Erin
    Brohmer, Hilmar
    Brown, Benjamin T.
    Brown, Kristina
    Bruening, Jovita
    Calhoun-Sauls, Ann
    Callahan, Shannon P.
    Chagnon, Elizabeth
    Chandler, Jesse
    Chartier, Christopher R.
    Cheung, Felix
    Christopherson, Cody D.
    Cillessen, Linda
    Clay, Russ
    Cleary, Hayley
    Cloud, Mark D.
    Cohn, Michael
    Cohoon, Johanna
    Columbus, Simon
    Cordes, Andreas
    Costantini, Giulio
    Alvarez, Leslie D. Cramblet
    Cremata, Ed
    Crusius, Jan
    DeCoster, Jamie
    DeGaetano, Michelle A.
    Della Penna, Nicolas
    den Bezemer, Bobby
    Deserno, Marie K.
    [J]. SCIENCE, 2015, 349 (6251)
  • [2] Academy of Medical Sciences, 2015, S REP AC MED SCI BBS
  • [3] Null hypothesis testing: Problems, prevalence, and an alternative
    Anderson, DR
    Burnham, KP
    Thompson, WL
    [J]. JOURNAL OF WILDLIFE MANAGEMENT, 2000, 64 (04) : 912 - 923
  • [4] [Anonymous], [No title captured]
  • [5] [Anonymous], 2013, SIGNIFICANCE TESTING
  • [6] [Anonymous], 2014, SIGNIFICANCE TEST CO
  • [7] Misconceptions of the p-value among Chilean and Italian Academic Psychologists
    Badenes-Ribera, Laura
    Frias-Navarro, Dolores
    Iotti, Bryan
    Bonilla-Campos, Amparo
    Longobardi, Claudio
    [J]. FRONTIERS IN PSYCHOLOGY, 2016, 7
  • [8] Baker M, 2016, NATURE, V533, P452, DOI 10.1038/533452a
  • [9] To P or not to P?
    Barber, Jarrett J.
    Ogle, Kiona
    [J]. ECOLOGY, 2014, 95 (03) : 621 - 626
  • [10] Rejection odds and rejection ratios: A proposal for statistical practice in testing hypotheses
    Bayarri, M. J.
    Benjamin, Daniel J.
    Berger, James O.
    Sellke, Thomas M.
    [J]. JOURNAL OF MATHEMATICAL PSYCHOLOGY, 2016, 72 : 90 - 103