Foundational Principles for Large-Scale Inference: Illustrations Through Correlation Mining

被引:15
作者
Hero, Alfred O., III [1 ]
Rajaratnam, Bala [2 ]
机构
[1] Univ Michigan, Ann Arbor, MI 48109 USA
[2] Stanford Univ, Stanford, CA 94305 USA
基金
美国国家科学基金会; 美国国家卫生研究院;
关键词
Asymptotic regimes; big data; correlation estimation; correlation mining; correlation screening; correlation selection; graphical models; large-scale inference; purely high dimensional; sample complexity; triple asymptotic framework; unifying learning theory; MAXIMUM-LIKELIHOOD; COVARIANCE ESTIMATION; SAMPLE COMPLEXITY; CFAR DETECTION; SPARSITY RECOVERY; MODEL SELECTION; BIG DATA; MATRIX; CONVERGENCE; CONSISTENCY;
D O I
10.1109/JPROC.2015.2494178
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
When can reliable inference be drawn in the "Big Data'' context? This paper presents a framework for answering this fundamental question in the context of correlation mining, with implications for general large-scale inference. In large-scale data applications like genomics, connectomics, and eco-informatics, the data set is often variable rich but sample starved: a regime where the number n of acquired samples (statistical replicates) is far fewer than the number p of observed variables (genes, neurons, voxels, or chemical constituents). Much of recent work has focused on understanding the computational complexity of proposed methods for "Big Data.'' Sample complexity, however, has received relatively less attention, especially in the setting when the sample size n is fixed, and the dimension p grows without bound. To address this gap, we develop a unified statistical framework that explicitly quantifies the sample complexity of various inferential tasks. Sampling regimes can be divided into several categories: 1) the classical asymptotic regime where the variable dimension is fixed and the sample size goes to infinity; 2) the mixed asymptotic regime where both variable dimension and sample size go to infinity at comparable rates; and 3) the purely high-dimensional asymptotic regime where the variable dimension goes to infinity and the sample size is fixed. Each regime has its niche but only the latter regime applies to exa-scale data dimension. We illustrate this high-dimensional framework for the problem of correlation mining, where it is the matrix of pairwise and partial correlations among the variables that are of interest. Correlation mining arises in numerous applications and subsumes the regression context as a special case. We demonstrate various regimes of correlation mining based on the unifying perspective of high-dimensional learning rates and sample complexity for different structured covariance models and different inference tasks.
引用
收藏
页码:93 / 110
页数:18
相关论文
共 154 条
[61]  
Johnson D. H., 1993, Array Signal Processing
[62]  
Kakade S.M., 2003, SAMPLE COMPLEXITY RE
[63]  
Kay S., 1998, FUNDAMENTALS STAT SI
[64]  
Kelly E. J., 1989, Tech. Rep. 848
[65]  
Khare K., 2014, Convergence of cyclic coordinatewise l1 minimization
[66]   A convex pseudolikelihood framework for high dimensional partial correlation estimation with convergence guarantees [J].
Khare, Kshitij ;
Oh, Sang-Yun ;
Rajaratnam, Bala .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY, 2015, 77 (04) :803-825
[67]   WISHART DISTRIBUTIONS FOR DECOMPOSABLE COVARIANCE GRAPH MODELS [J].
Khare, Kshitij ;
Rajaratnam, Bala .
ANNALS OF STATISTICS, 2011, 39 (01) :514-555
[68]   CONSISTENCY OF THE MAXIMUM-LIKELIHOOD ESTIMATOR IN THE PRESENCE OF INFINITELY MANY INCIDENTAL PARAMETERS [J].
KIEFER, J ;
WOLFOWITZ, J .
ANNALS OF MATHEMATICAL STATISTICS, 1956, 27 (04) :887-906
[69]  
Kim HS, 2001, IEEE T IMAGE PROCESS, V10, P1509, DOI 10.1109/83.951536
[70]   MATRIX FACTORIZATION TECHNIQUES FOR RECOMMENDER SYSTEMS [J].
Koren, Yehuda ;
Bell, Robert ;
Volinsky, Chris .
COMPUTER, 2009, 42 (08) :30-37