UMAP reveals cryptic population structure and phenotype heterogeneity in large genomic cohorts

被引:141
作者
Diaz-Papkovich, Alex [1 ,2 ,3 ]
Anderson-Trocme, Luke [2 ,3 ,4 ]
Ben-Eghan, Chief [2 ,3 ,4 ]
Gravel, Simon [2 ,3 ,4 ]
机构
[1] McGill Univ, Quantitat Life Sci, Montreal, PQ, Canada
[2] McGill Univ, Montreal, PQ, Canada
[3] Genome Quebec Innovat Ctr, Montreal, PQ, Canada
[4] McGill Univ, Dept Human Genet, Montreal, PQ, Canada
基金
加拿大自然科学与工程研究理事会;
关键词
ANCESTRY;
D O I
10.1371/journal.pgen.1008432
中图分类号
Q3 [遗传学];
学科分类号
071007 ; 090102 ;
摘要
Human populations feature both discrete and continuous patterns of variation. Current analysis approaches struggle to jointly identify these patterns because of modelling assumptions, mathematical constraints, or numerical challenges. Here we apply uniform manifold approximation and projection (UMAP), a non-linear dimension reduction tool, to three well-studied genotype datasets and discover overlooked subpopulations within the American Hispanic population, fine-scale relationships between geography, genotypes, and phenotypes in the UK population, and cryptic structure in the Thousand Genomes Project data. This approach is well-suited to the influx of large and diverse data and opens new lines of inquiry in population-scale datasets. Author summary The demographic history of human populations features varying geographic and social barriers to mating. Over time, these barriers have led to varying levels of genetic relatedness among individuals. This population structure is informative about human history, and can have a significant impact on studies of medical genetics. Because population structure depends on myriad demographic, ecological, and social forces, a priori visualization is useful to identify subtle patterns of population structure. We use a dimension reduction method-UMAP-to visualize population structure in three genomic datasets and find previously unobserved patterns, revealing fine-scale population structure and illustrating differences between groups in traits such as white blood cell count, height, and FEV1, a measure of lung function. Using UMAP is computationally efficient and can identify fine-scale population structure in large population datasets. We find it particularly useful to reveal phenotypic variation among genetically related populations, and recommend it is a complement to principal component analysis in primary data visualization.
引用
收藏
页数:24
相关论文
共 36 条
[1]  
23andMe, 2019, 23ANDME TESTS NEW AN
[2]   A global reference for human genetic variation [J].
Altshuler, David M. ;
Durbin, Richard M. ;
Abecasis, Goncalo R. ;
Bentley, David R. ;
Chakravarti, Aravinda ;
Clark, Andrew G. ;
Donnelly, Peter ;
Eichler, Evan E. ;
Flicek, Paul ;
Gabriel, Stacey B. ;
Gibbs, Richard A. ;
Green, Eric D. ;
Hurles, Matthew E. ;
Knoppers, Bartha M. ;
Korbel, Jan O. ;
Lander, Eric S. ;
Lee, Charles ;
Lehrach, Hans ;
Mardis, Elaine R. ;
Marth, Gabor T. ;
McVean, Gil A. ;
Nickerson, Deborah A. ;
Wang, Jun ;
Wilson, Richard K. ;
Boerwinkle, Eric ;
Doddapaneni, Harsha ;
Han, Yi ;
Korchina, Viktoriya ;
Kovar, Christie ;
Lee, Sandra ;
Muzny, Donna ;
Reid, Jeffrey G. ;
Zhu, Yiming ;
Chang, Yuqi ;
Feng, Qiang ;
Fang, Xiaodong ;
Guo, Xiaosen ;
Jian, Min ;
Jiang, Hui ;
Jin, Xin ;
Lan, Tianming ;
Li, Guoqing ;
Li, Jingxiang ;
Li, Yingrui ;
Liu, Shengmao ;
Liu, Xiao ;
Lu, Yao ;
Ma, Xuedi ;
Tang, Meifang ;
Wang, Bo .
NATURE, 2015, 526 (7571) :68-+
[3]   The Great Migration and African-American Genomic Diversity [J].
Baharian, Soheil ;
Barakatt, Maxime ;
Gignoux, Christopher R. ;
Shringarpure, Suyash ;
Errington, Jacob ;
Blot, William J. ;
Bustamante, Carlos D. ;
Kenny, Eimear E. ;
Williams, Scott M. ;
Aldrich, Melinda C. ;
Gravel, Simon .
PLOS GENETICS, 2016, 12 (05)
[4]   Dimensionality reduction for visualizing single-cell data using UMAP [J].
Becht, Etienne ;
McInnes, Leland ;
Healy, John ;
Dutertre, Charles-Antoine ;
Kwok, Immanuel W. H. ;
Ng, Lai Guan ;
Ginhoux, Florent ;
Newell, Evan W. .
NATURE BIOTECHNOLOGY, 2019, 37 (01) :38-+
[5]   PCAdmix: Principal Components-Based Assignment of Ancestry along Each Chromosome in Individuals with Admixed Ancestry from Two or More Populations [J].
Brisbin, Abra ;
Bryc, Katarzyna ;
Byrnes, Jake ;
Zakharia, Fouad ;
Omberg, Larsson ;
Degenhardt, Jeremiah ;
Reynolds, Andrew ;
Ostrer, Harry ;
Mezey, Jason G. ;
Bustamante, Carlos D. .
HUMAN BIOLOGY, 2012, 84 (04) :343-364
[6]   Clustering of 770,000 genomes reveals post-colonial population structure of North America [J].
Han, Eunjung ;
Carbonetto, Peter ;
Curtis, Ross E. ;
Wang, Yong ;
Granka, Julie M. ;
Byrnes, Jake ;
Noto, Keith ;
Kermany, Amir R. ;
Myres, Natalie M. ;
Barber, Mathew J. ;
Rand, Kristin A. ;
Song, Shiya ;
Roman, Theodore ;
Battat, Erin ;
Elyashiv, Eyal ;
Guturu, Harendra ;
Hong, Eurie L. ;
Chahine, Kenneth G. ;
Ball, Catherine A. .
NATURE COMMUNICATIONS, 2017, 8
[7]   A Genetic Atlas of Human Admixture History [J].
Hellenthal, Garrett ;
Busby, George B. J. ;
Band, Gavin ;
Wilson, James F. ;
Capelli, Cristian ;
Falush, Daniel ;
Myers, Simon .
SCIENCE, 2014, 343 (6172) :747-751
[8]   Matplotlib: A 2D graphics environment [J].
Hunter, John D. .
COMPUTING IN SCIENCE & ENGINEERING, 2007, 9 (03) :90-95
[9]  
Jones Eric, 2001, SciPy: Open source scientific tools for Python
[10]  
Jordan I, 2018, CRYPTIC NATIVE AM AN