Unsupervised representation learning on high-dimensional clinical data improves genomic discovery and prediction

被引:8
作者
Yun, Taedong [1 ]
Cosentino, Justin [2 ]
Behsaz, Babak [1 ]
McCaw, Zachary R. [2 ,13 ]
Hill, Davin [3 ,4 ]
Luben, Robert [5 ,6 ,7 ]
Lai, Dongbing [8 ]
Bates, John [9 ]
Yang, Howard [2 ]
Schwantes-An, Tae-Hwi [8 ,10 ]
Zhou, Yuchen [1 ]
Khawaja, Anthony P. [5 ,6 ,7 ]
Carroll, Andrew [2 ]
Hobbs, Brian D. [4 ,11 ,12 ]
Cho, Michael H. [4 ,11 ,12 ]
McLean, Cory Y. [1 ]
Hormozdiari, Farhad [1 ]
机构
[1] Google Res, Cambridge, MA 02142 USA
[2] Google Res, Mountain View, CA USA
[3] Northeastern Univ, Dept Elect & Comp Engn, Boston, MA USA
[4] Brigham & Womens Hosp, Channing Div Network Med, Boston, MA USA
[5] Moorfields Eye Hosp, NIHR Biomed Res Ctr, London, England
[6] Univ Coll London UCL, Inst Ophthalmol, London, England
[7] Univ Cambridge, MRC Epidemiol Unit, Cambridge, England
[8] Indiana Univ Sch Med, Dept Med & Mol Genet, Indianapolis, IN USA
[9] Verily Life Sci, South San Francisco, CA USA
[10] Indiana Univ Sch Med, Dept Med, Div Cardiol, Indianapolis, IN USA
[11] Brigham & Womens Hosp, Div Pulm & Crit Care Med, Boston, MA USA
[12] Harvard Med Sch, Boston, MA USA
[13] Insitro, South San Francisco, CA USA
基金
英国医学研究理事会; 美国国家卫生研究院; 英国科研创新办公室;
关键词
OBSTRUCTIVE PULMONARY-DISEASE; WIDE ASSOCIATION; CORRELATED PHENOTYPES; RISK; COPD; PHOTOPLETHYSMOGRAPHY; INSIGHTS; POWER; SET;
D O I
10.1038/s41588-024-01831-6
中图分类号
Q3 [遗传学];
学科分类号
071007 ; 090102 ;
摘要
Although high-dimensional clinical data (HDCD) are increasingly available in biobank-scale datasets, their use for genetic discovery remains challenging. Here we introduce an unsupervised deep learning model, Representation Learning for Genetic Discovery on Low-Dimensional Embeddings (REGLE), for discovering associations between genetic variants and HDCD. REGLE leverages variational autoencoders to compute nonlinear disentangled embeddings of HDCD, which become the inputs to genome-wide association studies (GWAS). REGLE can uncover features not captured by existing expert-defined features and enables the creation of accurate disease-specific polygenic risk scores (PRSs) in datasets with very few labeled data. We apply REGLE to perform GWAS on respiratory and circulatory HDCD-spirograms measuring lung function and photoplethysmograms measuring blood volume changes. REGLE replicates known loci while identifying others not previously detected. REGLE are predictive of overall survival, and PRSs constructed from REGLE loci improve disease prediction across multiple biobanks. Overall, REGLE contain clinically relevant information beyond that captured by existing expert-defined features, leading to improved genetic discovery and disease prediction. Representation Learning for Genetic Discovery on Low-Dimensional Embeddings (REGLE) uses machine learning to generate low-dimensional representations of healthcare data. Applied to lung spirograms and blood volume photoplethysmograms, REGLE factors capture additional information beyond expert-defined features, suggesting the utility of this approach.
引用
收藏
页码:1604 / 1613
页数:27
相关论文
共 50 条
[41]   Evaluation of Feature Ranking Ensembles for High-Dimensional Biomedical Data: A Case Study [J].
Kuncheva, Ludmila I. ;
Smith, Christopher J. ;
Syed, Yasir ;
Phillips, Christopher O. ;
Lewis, Keir E. .
12TH IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS (ICDMW 2012), 2012, :49-56
[42]   hdWGCNA identifies co-expression networks in high-dimensional transcriptomics data [J].
Morabito, Samuel ;
Reese, Fairlie ;
Rahimzadeh, Negin ;
Miyoshi, Emily ;
Swarup, Vivek .
CELL REPORTS METHODS, 2023, 3 (06)
[43]   Mining Low-Support Discriminative Patterns from Dense and High-Dimensional Data [J].
Fang, Gang ;
Pandey, Gaurav ;
Wang, Wen ;
Gupta, Manish ;
Steinbach, Michael ;
Kumar, Vipin .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2012, 24 (02) :279-294
[44]   A Non-Parametric Method for Building Predictive Genetic Tests on High-Dimensional Data [J].
Ye, Chengyin ;
Cui, Yuehua ;
Wei, Changshuai ;
Elston, Robert C. ;
Zhu, Jun ;
Lu, Qing .
HUMAN HEREDITY, 2011, 71 (03) :161-170
[45]   Using Evidence of Mixed Populations to Select Variables for Clustering Very High-Dimensional Data [J].
Chan, Yao-ban ;
Hall, Peter .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2010, 105 (490) :798-809
[46]   Attribute-based Explanation of Non-Linear Embeddings of High-Dimensional Data [J].
Sohns, Jan-Tobias ;
Schmitt, Michaela ;
Jirasek, Fabian ;
Hasse, Hans ;
Leitte, Heike .
IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, 2022, 28 (01) :540-550
[47]   A test for one-sample repeated measures designs: Effect of high-dimensional data [J].
Choopradit B. ;
Chongcharoen S. .
Journal of Applied Sciences, 2011, 11 (18) :3285-3292
[48]   A PERMUTATION TEST FOR TWO-SAMPLE MEANS AND SIGNAL IDENTIFICATION OF HIGH-DIMENSIONAL DATA [J].
Kong, Efang ;
Wang, Lengyang ;
Xia, Yingcun ;
Liu, Jin .
STATISTICA SINICA, 2022, 32 (01) :89-108
[49]   Data mining of high density genomic variant data for prediction of Alzheimer's disease risk [J].
Briones, Natalia ;
Dinu, Valentin .
BMC MEDICAL GENETICS, 2012, 13
[50]   MULTIPLE COMPARISON PROCEDURES FOR HIGH-DIMENSIONAL DATA AND THEIR ROBUSTNESS UNDER NON-NORMALITY [J].
Takahashi, Sho ;
Hyodo, Masashi ;
Nishiyama, Takahiro ;
Pavlenko, Tatjana .
JOURNAL JAPANESE SOCIETY OF COMPUTATIONAL STATISTICS, 2013, 26 (01) :71-82