Asymptotic properties of principal component analysis and shrinkage-bias adjustment under the generalized spiked population model

被引:5
作者
Dey, Rounak [1 ]
Lee, Seunggeun [1 ]
机构
[1] Univ Michigan, Sch Publ Hlth, Dept Biostat, 1415 Washington Hts, Ann Arbor, MI 48109 USA
基金
美国国家卫生研究院;
关键词
Consistent estimation; High-dimensional data; PC scores; Random matrix; LIMITING SPECTRAL DISTRIBUTION; COVARIANCE MATRICES; SAMPLE EIGENVALUES; EIGENVECTORS; CONVERGENCE; SCORES;
D O I
10.1016/j.jmva.2019.02.007
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
With the development of high-throughput technologies, principal component analysis (PCA) in the high-dimensional regime is of great interest. Most of the existing theoretical and methodological results for high-dimensional PCA are based on the spiked population model in which all the population eigenvalues are equal except for a few large ones. Due to the presence of local correlation among features, however, this assumption may not be satisfied in many real-world datasets. To address this issue, we investigate the asymptotic behavior of PCA under the generalized spiked population model. Based on our theoretical results, we propose a series of methods for the consistent estimation of population eigenvalues, angles between the sample and population eigenvectors, correlation coefficients between the sample and population principal component (PC) scores, and the shrinkage bias adjustment for the predicted PC scores. Using numerical experiments and real data examples from the genetics literature, we show that our methods can greatly reduce bias and improve prediction accuracy. (C) 2019 Elsevier Inc. All rights reserved.
引用
收藏
页码:145 / 164
页数:20
相关论文
共 23 条
[1]   Data quality control in genetic case-control association studies [J].
Anderson, Carl A. ;
Pettersson, Fredrik H. ;
Clarke, Geraldine M. ;
Cardon, Lon R. ;
Morris, Andrew P. ;
Zondervan, Krina T. .
NATURE PROTOCOLS, 2010, 5 (09) :1564-1573
[2]  
Bai ZD, 1998, ANN PROBAB, V26, P316
[3]   On sample eigenvalues in a generalized spiked population model [J].
Bai, Zhidong ;
Yao, Jianfeng .
JOURNAL OF MULTIVARIATE ANALYSIS, 2012, 106 :167-177
[4]   Eigenvalues of large sample covariance matrices of spiked population models [J].
Baik, Jinho ;
Silverstein, Jack W. .
JOURNAL OF MULTIVARIATE ANALYSIS, 2006, 97 (06) :1382-1408
[5]  
Berkelaar M., 2015, IPSOLVE INTERFACE LP
[6]  
Boyd Stephen P., 2014, Convex Optimization
[7]  
Cai T., 2017, LIMITING LAWS DIVERG
[8]   Convergence of Sample Eigenvectors of Spiked Population Model [J].
Ding, Xue .
COMMUNICATIONS IN STATISTICS-THEORY AND METHODS, 2015, 44 (18) :3825-3840
[9]   SPECTRUM ESTIMATION FOR LARGE DIMENSIONAL COVARIANCE MATRICES USING RANDOM MATRIX THEORY [J].
El Karoui, Noureddine .
ANNALS OF STATISTICS, 2008, 36 (06) :2757-2790
[10]  
Girko V.L., 1996, Random Operators and Stochastic Equations, V4, P176, DOI [10.1515/rose.1996.4.2.179, DOI 10.1515/ROSE.1996.4.2.179]