Faster model-based estimation of ancestry proportions

被引：0

作者：

Santander, Cindy G. ^{[1
]}

Martinez, Alba Refoyo ^{[2
]}

Meisner, Jonas ^{[3
,4
]}

机构：

[1] Univ Copenhagen, Dept Biol, Copenhagen, Denmark

[2] Univ Copenhagen, Ctr Hlth Data Sci, Copenhagen, Denmark

[3] Copenhagen Univ Hosp, Mental Hlth Ctr Copenhagen, Copenhagen, Denmark

[4] Univ Copenhagen, Novo Nordisk Fdn, Ctr Basic Metab Res, Copenhagen, Denmark

来源：

PEER COMMUNITY JOURNAL | 2024年 / 4卷

关键词：

POPULATION-STRUCTURE; ADMIXTURE;

D O I：

10.24072/pcjournal.503

中图分类号：

O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];

学科分类号：

07 ; 0710 ; 09 ;

摘要：

Ancestry estimation from genotype data in unrelated individuals has become an essential tool in population and medical genetics to understand demographic population histories and to model or correct for population structure. The ADMIXTURE software is a widely used model-based approach to account for population stratification, however, it struggles with convergence issues and does not scale to modern human datasets or the large number of variants in whole-genome sequencing data. Likelihood-free approaches optimize a least square objective and have gained popularity in recent years due to their scalability. However, this comes at the cost of accuracy in the ancestry estimates in more complex admixture scenarios. We present a new model-based approach, fastmixture, which adopts aspects from likelihood-free approaches for parameter initialization, followed by a mini-batch expectation-maximization procedure to model the standard likelihood. In a simulation study, we demonstrate that the model-based approaches of fastmixture and ADMIXTURE are significantly more accurate than recent and likelihood- free approaches. We further show that fastmixture runs approximately 30x faster than ADMIXTURE on both simulated and empirical data from the 1000 Genomes Project such that our model-based approach scales to much larger sample sizes than previously possible.

引用

页数：15

共 29 条

[21] Genes mirror geography within Europe
Novembre, John
Johnson, Toby
Bryc, Katarzyna
Kutalik, Zoltan
Boyko, Adam R.
Auton, Adam
Indap, Amit
King, Karen S.
Bergmann, Sven
Nelson, Matthew R.
Stephens, Matthew
Bustamante, Carlos D.
[J]. NATURE, 2008, 456 (7218) : 98 - U5
[22] Population structure and eigenanalysis
Patterson, Nick
Price, Alkes L.
Reich, David
[J]. PLOS GENETICS, 2006, 2 (12): : 2074 - 2093
[23] Pritchard JK, 2000, GENETICS, V155, P945
[24] Ruder S, 2017, Arxiv, DOI [arXiv:1609.04747, DOI 10.48550/ARXIV.1609.04747]
[25] Santander CG, 2024, bioRxiv, DOI [10.1101/2024.07.08.602454, 10.5281/zenodo.14106454, DOI 10.5281/ZENODO.14106454]
[26] Estimation of individual admixture: Analytical and study design considerations
Tang, H
Peng, J
Wang, P
Risch, NJ
[J]. GENETIC EPIDEMIOLOGY, 2005, 28 (04) : 289 - 301
[27] link-ancestors: fast simulation of local ancestry with tree sequence software
Tsambos, Georgia
Kelleher, Jerome
Ralph, Peter
Leslie, Stephen
Vukcevic, Damjan
[J]. BIOINFORMATICS ADVANCES, 2023, 3 (01):
[28] Challenges and Opportunities for Developing More Generalizable Polygenic Risk Scores
Wang, Ying
Tsuo, Kristin
Kanai, Masahiro
Neale, Benjamin M.
Martin, Alicia R.
[J]. ANNUAL REVIEW OF BIOMEDICAL DATA SCIENCE, 2022, 5 : 293 - 320
[29] A quasi-Newton acceleration for high-dimensional optimization algorithms
Zhou, Hua
Alexander, David
Lange, Kenneth
[J]. STATISTICS AND COMPUTING, 2011, 21 (02) : 261 - 273

← 1 2 3 →