Empirical Bayes PCA in high dimensions

被引:8
作者
Zhong, Xinyi [1 ]
Su, Chang [2 ]
Fan, Zhou [1 ]
机构
[1] Yale Univ, Dept Stat & Data Sci, New Haven, CT 06520 USA
[2] Yale Univ, Dept Biostat, New Haven, CT USA
关键词
Principal components analysis; empirical Bayes; random matrix theory; AMP algorithms; MAXIMUM-LIKELIHOOD-ESTIMATION; PRINCIPAL COMPONENT ANALYSIS; MESSAGE-PASSING ALGORITHMS; LOW-RANK MATRIX; SPARSE PCA; MIXTURE LIKELIHOODS; COVARIANCE MATRICES; LARGEST EIGENVALUE; CONSISTENCY; PHASE;
D O I
10.1111/rssb.12490
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
When the dimension of data is comparable to or larger than the number of data samples, principal components analysis (PCA) may exhibit problematic high-dimensional noise. In this work, we propose an empirical Bayes PCA method that reduces this noise by estimating a joint prior distribution for the principal components. EB-PCA is based on the classical Kiefer-Wolfowitz non-parametric maximum likelihood estimator for empirical Bayes estimation, distributional results derived from random matrix theory for the sample PCs and iterative refinement using an approximate message passing (AMP) algorithm. In theoretical 'spiked' models, EB-PCA achieves Bayes-optimal estimation accuracy in the same settings as an oracle Bayes AMP procedure that knows the true priors. Empirically, EB-PCA significantly improves over PCA when there is strong prior structure, both in simulation and on quantitative benchmarks constructed from the 1000 Genomes Project and the International HapMap Project. An illustration is presented for analysis of gene expression data obtained by single-cell RNA-seq.
引用
收藏
页码:853 / 878
页数:26
相关论文
共 95 条
  • [1] A global reference for human genetic variation
    Altshuler, David M.
    Durbin, Richard M.
    Abecasis, Goncalo R.
    Bentley, David R.
    Chakravarti, Aravinda
    Clark, Andrew G.
    Donnelly, Peter
    Eichler, Evan E.
    Flicek, Paul
    Gabriel, Stacey B.
    Gibbs, Richard A.
    Green, Eric D.
    Hurles, Matthew E.
    Knoppers, Bartha M.
    Korbel, Jan O.
    Lander, Eric S.
    Lee, Charles
    Lehrach, Hans
    Mardis, Elaine R.
    Marth, Gabor T.
    McVean, Gil A.
    Nickerson, Deborah A.
    Wang, Jun
    Wilson, Richard K.
    Boerwinkle, Eric
    Doddapaneni, Harsha
    Han, Yi
    Korchina, Viktoriya
    Kovar, Christie
    Lee, Sandra
    Muzny, Donna
    Reid, Jeffrey G.
    Zhu, Yiming
    Chang, Yuqi
    Feng, Qiang
    Fang, Xiaodong
    Guo, Xiaosen
    Jian, Min
    Jiang, Hui
    Jin, Xin
    Lan, Tianming
    Li, Guoqing
    Li, Jingxiang
    Li, Yingrui
    Liu, Shengmao
    Liu, Xiao
    Lu, Yao
    Ma, Xuedi
    Tang, Meifang
    Wang, Bo
    [J]. NATURE, 2015, 526 (7571) : 68 - +
  • [2] Integrating common and rare genetic variation in diverse human populations
    Altshuler, David M.
    Gibbs, Richard A.
    Peltonen, Leena
    Dermitzakis, Emmanouil
    Schaffner, Stephen F.
    Yu, Fuli
    Bonnen, Penelope E.
    de Bakker, Paul I. W.
    Deloukas, Panos
    Gabriel, Stacey B.
    Gwilliam, Rhian
    Hunt, Sarah
    Inouye, Michael
    Jia, Xiaoming
    Palotie, Aarno
    Parkin, Melissa
    Whittaker, Pamela
    Chang, Kyle
    Hawes, Alicia
    Lewis, Lora R.
    Ren, Yanru
    Wheeler, David
    Muzny, Donna Marie
    Barnes, Chris
    Darvishi, Katayoon
    Hurles, Matthew
    Korn, Joshua M.
    Kristiansson, Kati
    Lee, Charles
    McCarroll, Steven A.
    Nemesh, James
    Keinan, Alon
    Montgomery, Stephen B.
    Pollack, Samuela
    Price, Alkes L.
    Soranzo, Nicole
    Gonzaga-Jauregui, Claudia
    Anttila, Verneri
    Brodeur, Wendy
    Daly, Mark J.
    Leslie, Stephen
    McVean, Gil
    Moutsianas, Loukas
    Nguyen, Huy
    Zhang, Qingrun
    Ghori, Mohammed J. R.
    McGinnis, Ralph
    McLaren, William
    Takeuchi, Fumihiko
    Grossman, Sharon R.
    [J]. NATURE, 2010, 467 (7311) : 52 - 58
  • [3] High-dimensional analysis of semidefinite relaxations for sparse principal components
    Amini, Arash A.
    Wainwright, Martin J.
    [J]. 2008 IEEE INTERNATIONAL SYMPOSIUM ON INFORMATION THEORY PROCEEDINGS, VOLS 1-6, 2008, : 2454 - 2458
  • [4] [Anonymous], 2008, Advances in neural information processing systems
  • [5] [Anonymous], 2013, Advances in Neural Information Processing Systems
  • [6] [Anonymous], 2002, THESIS STANFORD U
  • [7] Glassy Nature of the Hard Phase in Inference Problems
    Antenucci, Fabrizio
    Franz, Silvio
    Urbani, Pierfrancesco
    Zdeborova, Lenka
    [J]. PHYSICAL REVIEW X, 2019, 9 (01):
  • [8] On sample eigenvalues in a generalized spiked population model
    Bai, Zhidong
    Yao, Jianfeng
    [J]. JOURNAL OF MULTIVARIATE ANALYSIS, 2012, 106 : 167 - 177
  • [9] Central limit theorems for eigenvalues in a spiked population model
    Bai, Zhidong
    Yao, Jian-Feng
    [J]. ANNALES DE L INSTITUT HENRI POINCARE-PROBABILITES ET STATISTIQUES, 2008, 44 (03): : 447 - 474
  • [10] Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices
    Baik, J
    Ben Arous, G
    Péché, S
    [J]. ANNALS OF PROBABILITY, 2005, 33 (05) : 1643 - 1697