Efficient toolkit implementing best practices for principal component analysis of population genetic data

被引:80
作者
Prive, Florian [1 ,2 ]
Luu, Keurcien [2 ]
Blum, Michael G. B. [2 ,3 ]
McGrath, John J. [1 ,4 ,5 ]
Vilhjalmsson, Bjarni J. [1 ]
机构
[1] Aarhus Univ, Natl Ctr Register Based Res, DK-8210 Aarhus, Denmark
[2] Univ Grenoble Alpes, Lab TIMC IMAG, UMR 5525, F-38700 La Tronche, France
[3] OWKIN France, F-75010 Paris, France
[4] Univ Queensland, Queensland Brain Inst, St Lucia, Qld 4072, Australia
[5] Queensland Ctr Mental Hlth Res, Pk Ctr Mental Hlth, Wacol, Qld 4076, Australia
基金
新加坡国家研究基金会; 英国医学研究理事会;
关键词
GENOME SCANS; STRATIFICATION; SELECTION; COMMON; SNPS;
D O I
10.1093/bioinformatics/btaa520
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Principal component analysis (PCA) of genetic data is routinely used to infer ancestry and control for population structure in various genetic analyses. However, conducting PCA analyses can be complicated and has several potential pitfalls. These pitfalls include (i) capturing linkage disequilibrium (LD) structure instead of population structure, (ii) projected PCs that suffer from shrinkage bias, (iii) detecting sample outliers and (iv) uneven population sizes. In this work, we explore these potential issues when using PCA, and present efficient solutions to these. Following applications to the UK Biobank and the 1000 Genomes project datasets, we make recommendations for best practices and provide efficient and user-friendly implementations of the proposed solutions in R packages bigsnpr and bigutilsr. Results: For example, we find that PC19-PC40 in the UK Biobank capture complex LD structure rather than population structure. Using our automatic algorithm for removing long-range LD regions, we recover 16 PCs that capture population structure only. Therefore, we recommend using only 16-18 PCs from the UK Biobank to account for population structure confounding. We also show how to use PCA to restrict analyses to individuals of homogeneous ancestry. Finally, when projecting individual genotypes onto the PCA computed from the 1000 Genomes project data, we find a shrinkage bias that becomes large for PC5 and beyond. We then demonstrate how to obtain unbiased projections efficiently using bigsnpr. Overall, we believe this work would be of interest for anyone using PCA in their analyses of genetic data, as well as for other omics data.
引用
收藏
页码:4449 / 4457
页数:9
相关论文
共 43 条
[1]   Population structure, migration, and diversifying selection in the Netherlands [J].
Abdellaoui, Abdel ;
Hottenga, Jouke-Jan ;
de Knijff, Peter ;
Nivard, Michel G. ;
Xiao, Xiangjun ;
Scheet, Paul ;
Brooks, Andrew ;
Ehli, Erik A. ;
Hu, Yueshan ;
Davies, Gareth E. ;
Hudziak, James J. ;
Sullivan, Patrick F. ;
van Beijsterveldt, Toos ;
Willemsen, Gonneke ;
de Geus, Eco J. ;
Penninx, Brenda W. J. H. ;
Boomsma, Dorret I. .
EUROPEAN JOURNAL OF HUMAN GENETICS, 2013, 21 (11) :1277-1285
[2]   FlashPCA2: principal component analysis of Biobank-scale genotype datasets [J].
Abraham, Gad ;
Qiu, Yixuan ;
Inouye, Michael .
BIOINFORMATICS, 2017, 33 (17) :2776-2778
[3]  
Agrawal A., 2019, SCALABLE PROBABILIST, DOI [10.1371/journal.pgen.1008773, DOI 10.1371/JOURNAL.PGEN.1008773]
[4]   A global reference for human genetic variation [J].
Altshuler, David M. ;
Durbin, Richard M. ;
Abecasis, Goncalo R. ;
Bentley, David R. ;
Chakravarti, Aravinda ;
Clark, Andrew G. ;
Donnelly, Peter ;
Eichler, Evan E. ;
Flicek, Paul ;
Gabriel, Stacey B. ;
Gibbs, Richard A. ;
Green, Eric D. ;
Hurles, Matthew E. ;
Knoppers, Bartha M. ;
Korbel, Jan O. ;
Lander, Eric S. ;
Lee, Charles ;
Lehrach, Hans ;
Mardis, Elaine R. ;
Marth, Gabor T. ;
McVean, Gil A. ;
Nickerson, Deborah A. ;
Wang, Jun ;
Wilson, Richard K. ;
Boerwinkle, Eric ;
Doddapaneni, Harsha ;
Han, Yi ;
Korchina, Viktoriya ;
Kovar, Christie ;
Lee, Sandra ;
Muzny, Donna ;
Reid, Jeffrey G. ;
Zhu, Yiming ;
Chang, Yuqi ;
Feng, Qiang ;
Fang, Xiaodong ;
Guo, Xiaosen ;
Jian, Min ;
Jiang, Hui ;
Jin, Xin ;
Lan, Tianming ;
Li, Guoqing ;
Li, Jingxiang ;
Li, Yingrui ;
Liu, Shengmao ;
Liu, Xiao ;
Lu, Yao ;
Ma, Xuedi ;
Tang, Meifang ;
Wang, Bo .
NATURE, 2015, 526 (7571) :68-+
[5]   Integrating common and rare genetic variation in diverse human populations [J].
Altshuler, David M. ;
Gibbs, Richard A. ;
Peltonen, Leena ;
Dermitzakis, Emmanouil ;
Schaffner, Stephen F. ;
Yu, Fuli ;
Bonnen, Penelope E. ;
de Bakker, Paul I. W. ;
Deloukas, Panos ;
Gabriel, Stacey B. ;
Gwilliam, Rhian ;
Hunt, Sarah ;
Inouye, Michael ;
Jia, Xiaoming ;
Palotie, Aarno ;
Parkin, Melissa ;
Whittaker, Pamela ;
Chang, Kyle ;
Hawes, Alicia ;
Lewis, Lora R. ;
Ren, Yanru ;
Wheeler, David ;
Muzny, Donna Marie ;
Barnes, Chris ;
Darvishi, Katayoon ;
Hurles, Matthew ;
Korn, Joshua M. ;
Kristiansson, Kati ;
Lee, Charles ;
McCarroll, Steven A. ;
Nemesh, James ;
Keinan, Alon ;
Montgomery, Stephen B. ;
Pollack, Samuela ;
Price, Alkes L. ;
Soranzo, Nicole ;
Gonzaga-Jauregui, Claudia ;
Anttila, Verneri ;
Brodeur, Wendy ;
Daly, Mark J. ;
Leslie, Stephen ;
McVean, Gil ;
Moutsianas, Loukas ;
Nguyen, Huy ;
Zhang, Qingrun ;
Ghori, Mohammed J. R. ;
McGinnis, Ralph ;
McLaren, William ;
Takeuchi, Fumihiko ;
Grossman, Sharon R. .
NATURE, 2010, 467 (7311) :52-58
[6]   A robust clustering algorithm for identifying problematic samples in genome-wide association studies [J].
Bellenguez, Celine ;
Strange, Amy ;
Freeman, Colin ;
Donnelly, Peter ;
Spencer, Chris C. A. .
BIOINFORMATICS, 2012, 28 (01) :134-135
[7]   TeraPCA: a fast and scalable software package to study genetic variation in tera-scale genotypes [J].
Bose, Aritra ;
Kalantzis, Vassilis ;
Kontopoulou, Eugenia-Maria ;
Elkady, Mai ;
Paschou, Peristera ;
Drineas, Petros .
BIOINFORMATICS, 2019, 35 (19) :3679-3683
[8]  
Brand M, 2003, SIAM PROC S, P37
[9]   A robust measure of skewness [J].
Brys, G ;
Hubert, M ;
Struyf, A .
JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS, 2004, 13 (04) :996-1017
[10]   The UK Biobank resource with deep phenotyping and genomic data [J].
Bycroft, Clare ;
Freeman, Colin ;
Petkova, Desislava ;
Band, Gavin ;
Elliott, Lloyd T. ;
Sharp, Kevin ;
Motyer, Allan ;
Vukcevic, Damjan ;
Delaneau, Olivier ;
O'Connell, Jared ;
Cortes, Adrian ;
Welsh, Samantha ;
Young, Alan ;
Effingham, Mark ;
McVean, Gil ;
Leslie, Stephen ;
Allen, Naomi ;
Donnelly, Peter ;
Marchini, Jonathan .
NATURE, 2018, 562 (7726) :203-+