Unsupervised discovery of ancestry-informative markers and genetic admixture proportions in biobank-scale datasets

被引:6
作者
Ko, Seyoon [1 ,2 ]
Chu, Benjamin B. [1 ,3 ]
Peterson, Daniel [4 ]
Okenwa, Chidera [5 ]
Papp, Jeanette C. [6 ]
Alexander, David H. [7 ]
Sobel, Eric M. [1 ,6 ]
Zhou, Hua [1 ,2 ]
Lange, Kenneth L. [1 ,6 ,8 ]
机构
[1] Univ Calif Los Angeles, Dept Computat Med, Los Angeles, CA 90095 USA
[2] Univ Calif Los Angeles, Dept Biostat, Los Angeles, CA 90095 USA
[3] Stanford Univ, Dept Biomed Data Sci, Stanford, CA 94305 USA
[4] Brigham Young Univ, Dept Math, Provo, UT 84602 USA
[5] Univ Calif Berkeley, Dept Math, Berkeley, CA 94720 USA
[6] Univ Calif Los Angeles, Dept Human Genet, Los Angeles, CA 90095 USA
[7] X Dev LLC, Mountain View, CA 94043 USA
[8] Univ Calif Los Angeles, Dept Stat, Los Angeles, CA 90095 USA
基金
美国国家科学基金会; 新加坡国家研究基金会;
关键词
POPULATION-STRUCTURE; INFERENCE; ASSOCIATION; CLUSTERS; MODELS; JULIA; SET;
D O I
10.1016/j.ajhg.2022.12.008
中图分类号
Q3 [遗传学];
学科分类号
071007 ; 090102 ;
摘要
Admixture estimation plays a crucial role in ancestry inference and genome-wide association studies (GWASs). Computer programs such as ADMIXTURE and STRUCTURE are commonly employed to estimate the admixture proportions of sample individuals. However, these programs can be overwhelmed by the computational burdens imposed by the 105 to 106 samples and millions of markers commonly found in modern biobanks. An attractive strategy is to run these programs on a set of ancestry-informative SNP markers (AIMs) that exhibit substantially different frequencies across populations. Unfortunately, existing methods for identifying AIMs require knowing ancestry labels for a subset of the sample. This supervised learning approach creates a chicken and the egg scenario. In this paper, we present an unsupervised, scalable framework that seamlessly carries out AIM selection and likelihood-based estimation of admixture proportions. Our simulated and real data examples show that this approach is scalable to modern biobank datasets. OpenADMIXTURE, our Julia implementation of the method, is open source and available for free.
引用
收藏
页码:313 / 325
页数:13
相关论文
共 50 条
[1]   FlashPCA2: principal component analysis of Biobank-scale genotype datasets [J].
Abraham, Gad ;
Qiu, Yixuan ;
Inouye, Michael .
BIOINFORMATICS, 2017, 33 (17) :2776-2778
[2]   Fast model-based estimation of ancestry in unrelated individuals [J].
Alexander, David H. ;
Novembre, John ;
Lange, Kenneth .
GENOME RESEARCH, 2009, 19 (09) :1655-1664
[3]   A global reference for human genetic variation [J].
Altshuler, David M. ;
Durbin, Richard M. ;
Abecasis, Goncalo R. ;
Bentley, David R. ;
Chakravarti, Aravinda ;
Clark, Andrew G. ;
Donnelly, Peter ;
Eichler, Evan E. ;
Flicek, Paul ;
Gabriel, Stacey B. ;
Gibbs, Richard A. ;
Green, Eric D. ;
Hurles, Matthew E. ;
Knoppers, Bartha M. ;
Korbel, Jan O. ;
Lander, Eric S. ;
Lee, Charles ;
Lehrach, Hans ;
Mardis, Elaine R. ;
Marth, Gabor T. ;
McVean, Gil A. ;
Nickerson, Deborah A. ;
Wang, Jun ;
Wilson, Richard K. ;
Boerwinkle, Eric ;
Doddapaneni, Harsha ;
Han, Yi ;
Korchina, Viktoriya ;
Kovar, Christie ;
Lee, Sandra ;
Muzny, Donna ;
Reid, Jeffrey G. ;
Zhu, Yiming ;
Chang, Yuqi ;
Feng, Qiang ;
Fang, Xiaodong ;
Guo, Xiaosen ;
Jian, Min ;
Jiang, Hui ;
Jin, Xin ;
Lan, Tianming ;
Li, Guoqing ;
Li, Jingxiang ;
Li, Yingrui ;
Liu, Shengmao ;
Liu, Xiao ;
Lu, Yao ;
Ma, Xuedi ;
Tang, Meifang ;
Wang, Bo .
NATURE, 2015, 526 (7571) :68-+
[4]   An integrated map of genetic variation from 1,092 human genomes [J].
Altshuler, David M. ;
Durbin, Richard M. ;
Abecasis, Goncalo R. ;
Bentley, David R. ;
Chakravarti, Aravinda ;
Clark, Andrew G. ;
Donnelly, Peter ;
Eichler, Evan E. ;
Flicek, Paul ;
Gabriel, Stacey B. ;
Gibbs, Richard A. ;
Green, Eric D. ;
Hurles, Matthew E. ;
Knoppers, Bartha M. ;
Korbel, Jan O. ;
Lander, Eric S. ;
Lee, Charles ;
Lehrach, Hans ;
Mardis, Elaine R. ;
Marth, Gabor T. ;
McVean, Gil A. ;
Nickerson, Deborah A. ;
Schmidt, Jeanette P. ;
Sherry, Stephen T. ;
Wang, Jun ;
Wilson, Richard K. ;
Gibbs, Richard A. ;
Dinh, Huyen ;
Kovar, Christie ;
Lee, Sandra ;
Lewis, Lora ;
Muzny, Donna ;
Reid, Jeff ;
Wang, Min ;
Wang, Jun ;
Fang, Xiaodong ;
Guo, Xiaosen ;
Jian, Min ;
Jiang, Hui ;
Jin, Xin ;
Li, Guoqing ;
Li, Jingxiang ;
Li, Yingrui ;
Li, Zhuo ;
Liu, Xiao ;
Lu, Yao ;
Ma, Xuedi ;
Su, Zhe ;
Tai, Shuaishuai ;
Tang, Meifang .
NATURE, 2012, 491 (7422) :56-65
[5]  
Arthur D, 2007, PROCEEDINGS OF THE EIGHTEENTH ANNUAL ACM-SIAM SYMPOSIUM ON DISCRETE ALGORITHMS, P1027
[6]   A METHOD FOR QUANTIFYING DIFFERENTIATION BETWEEN POPULATIONS AT MULTI-ALLELIC LOCI AND ITS IMPLICATIONS FOR INVESTIGATING IDENTITY AND PATERNITY [J].
BALDING, DJ ;
NICHOLS, RA .
GENETICA, 1995, 96 (1-2) :3-12
[7]   pong: fast analysis and visualization of latent clusters in population genetic data [J].
Behr, Aaron A. ;
Liu, Katherine Z. ;
Liu-Fang, Gracie ;
Nakka, Priyanka ;
Ramachandran, Sohini .
BIOINFORMATICS, 2016, 32 (18) :2817-2823
[8]   Effective Extensible Programming: Unleashing Julia on GPUs [J].
Besard, Tim ;
Foket, Christophe ;
De Sutter, Bjorn .
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2019, 30 (04) :827-841
[9]   Julia: A Fresh Approach to Numerical Computing [J].
Bezanson, Jeff ;
Edelman, Alan ;
Karpinski, Stefan ;
Shah, Viral B. .
SIAM REVIEW, 2017, 59 (01) :65-98
[10]   Enhanced Methods for Local Ancestry Assignment in Sequenced Admixed Individuals [J].
Brown, Robert ;
Pasaniuc, Bogdan .
PLOS COMPUTATIONAL BIOLOGY, 2014, 10 (04)