Application of t-SNE to human genetic data

被引:136
作者
Li, Wentian [1 ]
Cerise, Jane E. [1 ]
Yang, Yaning [2 ]
Han, Henry [3 ]
机构
[1] Feinstein Inst Med Res, Robert S Boas Ctr Genom & Human Genet, Northwell Hlth, Manhasset, NY 11030 USA
[2] Univ Sci & Technol China, Dept Stat & Finance, Hefei, Anhui, Peoples R China
[3] Fordham Univ, Lincoln Ctr, Dept Comp & Informat Sci, New York, NY USA
关键词
t-SNE; PCA; SNP; dimension reduction; PRINCIPAL-COMPONENT ANALYSIS; FAMILY-BASED TESTS; POPULATION-STRUCTURE; ASSOCIATION; ALGORITHM; MODEL; STRATIFICATION; INFERENCE; CORRECTS; RISK;
D O I
10.1142/S0219720017500172
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
The t-distributed stochastic neighbor embedding t-SNE is a new dimension reduction and visualization technique for high-dimensional data. t-SNE is rarely applied to human genetic data, even though it is commonly used in other data-intensive biological fields, such as single-cell genomics. We explore the applicability of t-SNE to human genetic data and make these observations: (i) similar to previously used dimension reduction techniques such as principal component analysis (PCA), t-SNE is able to separate samples from different continents; (ii) unlike PCA, t-SNE is more robust with respect to the presence of outliers; (iii) t-SNE is able to display both continental and sub-continental patterns in a single plot. We conclude that the ability for t-SNE to reveal population stratification at different scales could be useful for human genetic association studies.
引用
收藏
页数:14
相关论文
共 55 条
[1]   Fast Principal Component Analysis of Large-Scale Genome-Wide Data [J].
Abraham, Gad ;
Inouye, Michael .
PLOS ONE, 2014, 9 (04)
[2]   A map of human genome variation from population-scale sequencing [J].
Altshuler, David ;
Durbin, Richard M. ;
Abecasis, Goncalo R. ;
Bentley, David R. ;
Chakravarti, Aravinda ;
Clark, Andrew G. ;
Collins, Francis S. ;
De la Vega, Francisco M. ;
Donnelly, Peter ;
Egholm, Michael ;
Flicek, Paul ;
Gabriel, Stacey B. ;
Gibbs, Richard A. ;
Knoppers, Bartha M. ;
Lander, Eric S. ;
Lehrach, Hans ;
Mardis, Elaine R. ;
McVean, Gil A. ;
Nickerson, DebbieA. ;
Peltonen, Leena ;
Schafer, Alan J. ;
Sherry, Stephen T. ;
Wang, Jun ;
Wilson, Richard K. ;
Gibbs, Richard A. ;
Deiros, David ;
Metzker, Mike ;
Muzny, Donna ;
Reid, Jeff ;
Wheeler, David ;
Wang, Jun ;
Li, Jingxiang ;
Jian, Min ;
Li, Guoqing ;
Li, Ruiqiang ;
Liang, Huiqing ;
Tian, Geng ;
Wang, Bo ;
Wang, Jian ;
Wang, Wei ;
Yang, Huanming ;
Zhang, Xiuqing ;
Zheng, Huisong ;
Lander, Eric S. ;
Altshuler, David L. ;
Ambrogio, Lauren ;
Bloom, Toby ;
Cibulskis, Kristian ;
Fennell, Tim J. ;
Gabriel, Stacey B. .
NATURE, 2010, 467 (7319) :1061-1073
[3]   viSNE enables visualization of high dimensional single-cell data and reveals phenotypic heterogeneity of leukemia [J].
Amir, El-ad David ;
Davis, Kara L. ;
Tadmor, Michelle D. ;
Simonds, Erin F. ;
Levine, Jacob H. ;
Bendall, Sean C. ;
Shenfeld, Daniel K. ;
Krishnaswamy, Smita ;
Nolan, Garry P. ;
Pe'er, Dana .
NATURE BIOTECHNOLOGY, 2013, 31 (06) :545-+
[4]  
[Anonymous], ENCY SYSTEM BIOL DUB
[5]  
[Anonymous], BIORXIV
[6]  
[Anonymous], 2015, Nature, DOI [DOI 10.1038/NATURE15393, 10.1038/nature15393]
[7]  
[Anonymous], 2012, Nature
[8]  
[Anonymous], ENCY SYSTEM BIOL DUB
[9]  
[Anonymous], ROBUST METHODS DATA
[10]   A tutorial on statistical methods for population association studies [J].
Balding, David J. .
NATURE REVIEWS GENETICS, 2006, 7 (10) :781-791