Challenges of Big Data analysis

被引:775
作者
Fan, Jianqing [1 ]
Han, Fang [2 ]
Liu, Han [1 ]
机构
[1] Princeton Univ, Dept Operat Res & Financial Engn, Princeton, NJ 08544 USA
[2] Johns Hopkins Univ, Dept Biostat, Baltimore, MD 21205 USA
基金
美国国家科学基金会; 美国国家卫生研究院;
关键词
Big Data; noise accumulation; spurious correlation; incidental endogeneity; data storage; scalability; FALSE DISCOVERY RATE; NONCONCAVE PENALIZED LIKELIHOOD; VARIABLE SELECTION; THRESHOLDING ALGORITHM; GENE-EXPRESSION; REGRESSION; LASSO; NUMBER; MODELS; REGULARIZATION;
D O I
10.1093/nsr/nwt032
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Big Data bring new opportunities to modern society and challenges to data scientists. On the one hand, Big Data hold great promises for discovering subtle population patterns and heterogeneities that are not possible with small-scale data. On the other hand, the massive sample size and high dimensionality of Big Data introduce unique computational and statistical challenges, including scalability and storage bottleneck, noise accumulation, spurious correlation, incidental endogeneity and measurement errors. These challenges are distinguished and require new computational and statistical paradigm. This paper gives overviews on the salient features of Big Data and how these features impact on paradigm change on statistical and computational methods as well as computing architectures. We also provide various new perspectives on the Big Data analysis and computation. In particular, we emphasize on the viability of the sparsest solution in high-confidence set and point out that exogenous assumptions in most statistical methods for Big Data cannot be validated due to incidental endogeneity. They can lead to wrong statistical inferences and consequently wrong scientific conclusions.
引用
收藏
页码:293 / 314
页数:22
相关论文
共 122 条
[1]  
Achlioptas D, 2001, 20 ACM SIGMOD SIGACT
[2]   The ADHD-200 Consortium: a model to advance the translational potential of neuroimaging in clinical neuroscience [J].
Acuna, Carlos .
FRONTIERS IN SYSTEMS NEUROSCIENCE, 2012, 6
[3]   FAST GLOBAL CONVERGENCE OF GRADIENT METHODS FOR HIGH-DIMENSIONAL STATISTICAL RECOVERY [J].
Agarwal, Alekh ;
Negahban, Sahand ;
Wainwright, Martin J. .
ANNALS OF STATISTICS, 2012, 40 (05) :2452-2482
[4]   NOISY MATRIX DECOMPOSITION VIA CONVEX RELAXATION: OPTIMAL RATES IN HIGH DIMENSIONS [J].
Agarwal, Alekh ;
Negahban, Sahand ;
Wainwright, Martin J. .
ANNALS OF STATISTICS, 2012, 40 (02) :1171-1197
[5]   NEW LOOK AT STATISTICAL-MODEL IDENTIFICATION [J].
AKAIKE, H .
IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 1974, AC19 (06) :716-723
[6]  
[Anonymous], DISCRETE COSINE TRAN
[7]  
[Anonymous], 2013, ARXIV13064960
[8]  
[Anonymous], TECHNICAL REPORT
[9]  
[Anonymous], 2011, ARXIV11063915
[10]  
[Anonymous], 1997, J. Italian Stat. Soc, DOI DOI 10.1007/BF03178905