The Deluge of Spurious Correlations in Big Data

被引:4
作者
Cristian S. Calude
Giuseppe Longo
机构
[1] University of Auckland,Department of Computer Science
[2] Collège de France & École Normale Supérieure,Centre Cavaillès (République des Savoirs), CNRS
[3] Tufts University School of Medicine,Department of Integrative Physiology and Pathobiology
来源
Foundations of Science | 2017年 / 22卷
关键词
Big data; Ergodic theory; Ramsey theory; Algorithmic information theory; Correlation;
D O I
暂无
中图分类号
学科分类号
摘要
Very large databases are a major opportunity for science and data analytics is a remarkable new field of investigation in computer science. The effectiveness of these tools is used to support a “philosophy” against the scientific method as developed throughout history. According to this view, computer-discovered correlations should replace understanding and guide prediction and action. Consequently, there will be no need to give scientific meaning to phenomena, by proposing, say, causal relations, since regularities in very large databases are enough: “with enough data, the numbers speak for themselves”. The “end of science” is proclaimed. Using classical results from ergodic theory, Ramsey theory and algorithmic information theory, we show that this “philosophy” is wrong. For example, we prove that very large databases have to contain arbitrary correlations. These correlations appear only due to the size, not the nature, of data. They can be found in “randomly” generated, large enough databases, which—as we will prove—implies that most correlations are spurious. Too much information tends to behave like very little information. The scientific method can be enriched by computer mining in immense databases, but not replaced by it.
引用
收藏
页码:595 / 612
页数:17
相关论文
共 25 条
[1]  
Andrews GE(2012)Drowning in the data deluge Notices of the AMS: American Mathematical Society 59 933-941
[2]  
Cecconi F(2012)Predicting the future from the past: An old problem from a modern perspective American Journal of Physics 80 1001-1008
[3]  
Cencini M(2014)Scientific method: Defend the integrity of physics Nature 516 321-323
[4]  
Falcioni M(1956)Are correlations any guide to predictive value? Journal of the Royal Statistical Society Series C (Applied Statistics) 5 113-121
[5]  
Vulpiani A(2012)Big data and their epistemological challenge Philosophy and Technology 25 435-437
[6]  
Ellis G(2015)Big data and its epistemology Journal of the Association for Information Science and Technology 66 651-661
[7]  
Silk J(2001)A new proof of Szemerédi’s theorem Geometric and Functional Analysis 11 465-588
[8]  
Ferber R(1990)Ramsey theory Scientific American 262 112-117
[9]  
Floridi L(1947)On the notion of recurrence in discrete stochastic processes Bulletin of the AMS: American Mathematical Society 53 1002-1010
[10]  
Frické M(2014)Big data, new epistemologies and paradigm shifts Big Data & Society 1 1-12