Efficient toolkit implementing best practices for principal component analysis of population genetic data

被引:80
作者
Prive, Florian [1 ,2 ]
Luu, Keurcien [2 ]
Blum, Michael G. B. [2 ,3 ]
McGrath, John J. [1 ,4 ,5 ]
Vilhjalmsson, Bjarni J. [1 ]
机构
[1] Aarhus Univ, Natl Ctr Register Based Res, DK-8210 Aarhus, Denmark
[2] Univ Grenoble Alpes, Lab TIMC IMAG, UMR 5525, F-38700 La Tronche, France
[3] OWKIN France, F-75010 Paris, France
[4] Univ Queensland, Queensland Brain Inst, St Lucia, Qld 4072, Australia
[5] Queensland Ctr Mental Hlth Res, Pk Ctr Mental Hlth, Wacol, Qld 4076, Australia
基金
新加坡国家研究基金会; 英国医学研究理事会;
关键词
GENOME SCANS; STRATIFICATION; SELECTION; COMMON; SNPS;
D O I
10.1093/bioinformatics/btaa520
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Principal component analysis (PCA) of genetic data is routinely used to infer ancestry and control for population structure in various genetic analyses. However, conducting PCA analyses can be complicated and has several potential pitfalls. These pitfalls include (i) capturing linkage disequilibrium (LD) structure instead of population structure, (ii) projected PCs that suffer from shrinkage bias, (iii) detecting sample outliers and (iv) uneven population sizes. In this work, we explore these potential issues when using PCA, and present efficient solutions to these. Following applications to the UK Biobank and the 1000 Genomes project datasets, we make recommendations for best practices and provide efficient and user-friendly implementations of the proposed solutions in R packages bigsnpr and bigutilsr. Results: For example, we find that PC19-PC40 in the UK Biobank capture complex LD structure rather than population structure. Using our automatic algorithm for removing long-range LD regions, we recover 16 PCs that capture population structure only. Therefore, we recommend using only 16-18 PCs from the UK Biobank to account for population structure confounding. We also show how to use PCA to restrict analyses to individuals of homogeneous ancestry. Finally, when projecting individual genotypes onto the PCA computed from the 1000 Genomes project data, we find a shrinkage bias that becomes large for PC5 and beyond. We then demonstrate how to obtain unbiased projections efficiently using bigsnpr. Overall, we believe this work would be of interest for anyone using PCA in their analyses of genetic data, as well as for other omics data.
引用
收藏
页码:4449 / 4457
页数:9
相关论文
共 43 条
[11]   Second-generation PLINK: rising to the challenge of larger and richer datasets [J].
Chang, Christopher C. ;
Chow, Carson C. ;
Tellier, Laurent C. A. M. ;
Vattikuti, Shashaank ;
Purcell, Shaun M. ;
Lee, James J. .
GIGASCIENCE, 2015, 4
[12]   Guidelines for cell-type heterogeneity quantification based on a comparative analysis of reference-free DNA methylation deconvolution software [J].
Decamps, Clementine ;
Prive, Florian ;
Bacher, Raphael ;
Jost, Daniel ;
Waguet, Arthur ;
Achard, Sophie ;
Achard, Sophie ;
Amblard, Elise ;
Bacher, Raphael ;
Bergmann, Fabian ;
Blum, Michael ;
Blum, Yuna ;
Bottaz-Bosson, Guillaume ;
Broseus, Lucile ;
Chuffart, Florent ;
Decamps, Clementine ;
Devijver, Emilie ;
Durif, Ghislain ;
Feofanov, Vassili ;
Houseman, Eugene Andres ;
Gallopin, Melina ;
Jedynak, Paulina ;
Jonchere, Vincent ;
Van de Geer, Ellen ;
Jumentier, Basile ;
Kaoma, Tony ;
Lurie, Eugene ;
Lutsik, Pavlo ;
Markowski, Julia ;
Melnykova, Anna ;
Merlevede, Jane ;
Nazarov, Petr ;
Nguyen, Ngoc Ha ;
Permiakova, Olga ;
Prive, Florian ;
Richard, Magali ;
Rolland, Matthieu ;
Scherer, Michael ;
Spill, Yannick ;
Houseman, Eugene Andres ;
Lurie, Eugene ;
Lutsik, Pavlo ;
Milosavljevic, Aleksandar ;
Scherer, Michael ;
Blum, Michael G. B. ;
Richard, Magali .
BMC BIOINFORMATICS, 2020, 21 (01)
[13]   Asymptotic properties of principal component analysis and shrinkage-bias adjustment under the generalized spiked population model [J].
Dey, Rounak ;
Lee, Seunggeun .
JOURNAL OF MULTIVARIATE ANALYSIS, 2019, 173 :145-164
[14]  
Elseberg J., 2012, J. Software Eng. Robot., V3, P2
[15]   Fast Principal-Component Analysis Reveals Convergent Evolution of ADH1B in Europe and East Asia [J].
Galinsky, Kevin J. ;
Bhatia, Gaurav ;
Loh, Po-Ru ;
Georgiev, Stoyan ;
Mukherjee, Sayan ;
Patterson, Nick J. ;
Price, Alkes L. .
AMERICAN JOURNAL OF HUMAN GENETICS, 2016, 98 (03) :456-472
[16]   ROBUST ESTIMATES, RESIDUALS, AND OUTLIER DETECTION WITH MULTIRESPONSE DATA [J].
GNANADESIKAN, R ;
KETTENRING, JR .
BIOMETRICS, 1972, 28 (01) :81-+
[17]   An adjusted boxplot for skewed distributions [J].
Hubert, M. ;
Vandervieren, E. .
COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2008, 52 (12) :5186-5201
[18]  
Kriegel H.P., 2009, P 18 ACM C INFORM KN, P1649
[19]   CONVERGENCE AND PREDICTION OF PRINCIPAL COMPONENT SCORES IN HIGH-DIMENSIONAL SETTINGS [J].
Lee, Seunggeun ;
Zou, Fei ;
Wright, Fred A. .
ANNALS OF STATISTICS, 2010, 38 (06) :3605-3629
[20]   Deflation techniques for an implicitly restarted Arnoldi iteration [J].
Lehoucq, RB ;
Sorensen, DC .
SIAM JOURNAL ON MATRIX ANALYSIS AND APPLICATIONS, 1996, 17 (04) :789-821