Misinterpreting p: The Discrepancy Between p Values and the Probability the Null Hypothesis is True, the Influence of Multiple Testing, and Implications for the Replication Crisis

被引：33

作者：

Anderson, Samantha F. ^{[1
]}

机构：

[1] Arizona State Univ, Tempe, AZ 85281 USA

来源：

PSYCHOLOGICAL METHODS | 2020年 / 25卷 / 05期

关键词：

p values; multiple testing; statistical significance; replication; QUESTIONABLE RESEARCH PRACTICES; PSYCHOLOGICAL-RESEARCH; PUBLICATION BIAS; BAYES FACTORS; EFFECT SIZES; SAMPLE-SIZE; CONSEQUENCES; SCIENCE; PREVALENCE; FISHER;

D O I：

10.1037/met0000248

中图分类号：

B84 [心理学];

学科分类号：

04 ; 0402 ;

摘要：

The p value is still misinterpreted as the probability that the null hypothesis is true. Even psychologists who correctly understand that p values do not provide this probability may not realize the degree to which p values differ from the probability that the null hypothesis is true. Importantly, previous research on this topic has not addressed the influence of multiple testing, often a reality in psychological studies, and has not extensively considered the influence of different prior probabilities favoring the null and alternative hypotheses. Simulation studies are presented that emphasize the magnitude by which p values are distinct from the posterior probability that the null hypothesis is true, under an extensive set of conditions including multiple testing. Particular emphasis is placed on p values just under .05, given the prevalence of these p values in the published literature, though p values in other intervals are also assessed. In diverse conditions, results indicate that posterior probabilities favoring the null hypothesis are often far removed from .05, and this pattern quickly gets much worse when multiple testing is conducted. Rather than simply telling researchers that p values do not reflect the probability favoring the null hypothesis, as has been done previously, the results presented here allow psychologists to see the evidence provided by various p values. These results have particularly topical implications for the replication crisis, for how much weight should be placed on a single study, and for how the term statistical significance should be interpreted, particularly in conditions typical in psychological research. Translational Abstract Scientific studies often pit two hypotheses against each other: a null hypothesis (typically a claim of no effect) and an alternative hypothesis (which claims the effect of interest exists). Studies often rely heavily on a quantity known as the p value to evaluate the results. The p value is commonly believed to imply the likelihood that the null hypothesis is true: A small p value would imply that it is unlikely that the null hypothesis is true, leading psychologists to find support for the alternative hypothesis instead. However, p values do not, in fact, reveal the likelihood that the null hypothesis is true. This article (a) shows how different the p value is from the corresponding likelihood favoring the null hypothesis under a variety of important conditions; (b) investigates the influence that multiple testing (conducting multiple statistical tests on the same or similar sets of variables) and the overall likelihood that the null hypothesis is true have on these differences; and (c) pays particular attention to p values falling just under .05, the standard threshold for considering a result "statistically significant." Results indicate that p values are often very different from the likelihood that the null hypothesis is true, and multiple testing makes these differences even larger. These results have implications for the replication crisis, for relying too much on single studies, and for how the statistical significance should be interpreted.

引用

页码：596 / 609

页数：14

共 82 条

[1] Estimating the reproducibility of psychological science [J].