Unsupervised representation learning on high-dimensional clinical data improves genomic discovery and prediction

被引:8
作者
Yun, Taedong [1 ]
Cosentino, Justin [2 ]
Behsaz, Babak [1 ]
McCaw, Zachary R. [2 ,13 ]
Hill, Davin [3 ,4 ]
Luben, Robert [5 ,6 ,7 ]
Lai, Dongbing [8 ]
Bates, John [9 ]
Yang, Howard [2 ]
Schwantes-An, Tae-Hwi [8 ,10 ]
Zhou, Yuchen [1 ]
Khawaja, Anthony P. [5 ,6 ,7 ]
Carroll, Andrew [2 ]
Hobbs, Brian D. [4 ,11 ,12 ]
Cho, Michael H. [4 ,11 ,12 ]
McLean, Cory Y. [1 ]
Hormozdiari, Farhad [1 ]
机构
[1] Google Res, Cambridge, MA 02142 USA
[2] Google Res, Mountain View, CA USA
[3] Northeastern Univ, Dept Elect & Comp Engn, Boston, MA USA
[4] Brigham & Womens Hosp, Channing Div Network Med, Boston, MA USA
[5] Moorfields Eye Hosp, NIHR Biomed Res Ctr, London, England
[6] Univ Coll London UCL, Inst Ophthalmol, London, England
[7] Univ Cambridge, MRC Epidemiol Unit, Cambridge, England
[8] Indiana Univ Sch Med, Dept Med & Mol Genet, Indianapolis, IN USA
[9] Verily Life Sci, South San Francisco, CA USA
[10] Indiana Univ Sch Med, Dept Med, Div Cardiol, Indianapolis, IN USA
[11] Brigham & Womens Hosp, Div Pulm & Crit Care Med, Boston, MA USA
[12] Harvard Med Sch, Boston, MA USA
[13] Insitro, South San Francisco, CA USA
基金
英国科研创新办公室; 英国医学研究理事会; 美国国家卫生研究院;
关键词
OBSTRUCTIVE PULMONARY-DISEASE; WIDE ASSOCIATION; CORRELATED PHENOTYPES; RISK; COPD; PHOTOPLETHYSMOGRAPHY; INSIGHTS; POWER; SET;
D O I
10.1038/s41588-024-01831-6
中图分类号
Q3 [遗传学];
学科分类号
071007 ; 090102 ;
摘要
Although high-dimensional clinical data (HDCD) are increasingly available in biobank-scale datasets, their use for genetic discovery remains challenging. Here we introduce an unsupervised deep learning model, Representation Learning for Genetic Discovery on Low-Dimensional Embeddings (REGLE), for discovering associations between genetic variants and HDCD. REGLE leverages variational autoencoders to compute nonlinear disentangled embeddings of HDCD, which become the inputs to genome-wide association studies (GWAS). REGLE can uncover features not captured by existing expert-defined features and enables the creation of accurate disease-specific polygenic risk scores (PRSs) in datasets with very few labeled data. We apply REGLE to perform GWAS on respiratory and circulatory HDCD-spirograms measuring lung function and photoplethysmograms measuring blood volume changes. REGLE replicates known loci while identifying others not previously detected. REGLE are predictive of overall survival, and PRSs constructed from REGLE loci improve disease prediction across multiple biobanks. Overall, REGLE contain clinically relevant information beyond that captured by existing expert-defined features, leading to improved genetic discovery and disease prediction. Representation Learning for Genetic Discovery on Low-Dimensional Embeddings (REGLE) uses machine learning to generate low-dimensional representations of healthcare data. Applied to lung spirograms and blood volume photoplethysmograms, REGLE factors capture additional information beyond expert-defined features, suggesting the utility of this approach.
引用
收藏
页码:1604 / 1613
页数:27
相关论文
共 50 条
[21]   Unbiased Prediction and Feature Selection in High-Dimensional Survival Regression [J].
Laimighofer, Michael ;
Krumsiek, Jan ;
Buettner, Florian ;
Theis, Fabian J. .
JOURNAL OF COMPUTATIONAL BIOLOGY, 2016, 23 (04) :279-290
[22]   Automated Survival Prediction in Metastatic Cancer Patients Using High-Dimensional Electronic Medical Record Data [J].
Gensheimer, Michael F. ;
Henry, A. Solomon ;
Wood, Douglas J. ;
Hastie, Trevor J. ;
Aggarwal, Sonya ;
Dudley, Sara A. ;
Pradhan, Pooja ;
Banerjee, Imon ;
Cho, Eunpi ;
Ramchandran, Kavitha ;
Pollom, Erqi ;
Koong, Albert C. ;
Rubin, Daniel L. ;
Chang, Daniel T. .
JNCI-JOURNAL OF THE NATIONAL CANCER INSTITUTE, 2019, 111 (06) :568-574
[23]   Biomarker Guidelines for High-Dimensional Genomic Studies in Transplantation: Adding Method to the Madness [J].
Kurian, Sunil M. ;
Whisenant, Thomas ;
Mas, Valeria ;
Heilman, Raymond ;
Abecassis, Michael ;
Salomon, Daniel R. ;
Moss, Adyr ;
Kaplan, Bruce .
TRANSPLANTATION, 2017, 101 (03) :457-463
[24]   Optimizing high-dimensional stochastic forestry via reinforcement learning [J].
Tahvonen, Olli ;
Suominen, Antti ;
Malo, Pekka ;
Viitasaari, Lauri ;
Parkatti, Vesa-Pekka .
JOURNAL OF ECONOMIC DYNAMICS & CONTROL, 2022, 145
[25]   High-dimensional role of Al and machine learning in cancer research [J].
Capobianco, Enrico .
BRITISH JOURNAL OF CANCER, 2022, 126 (04) :523-532
[26]   A Hybrid Ensemble Feature Selection-Based Learning Model for COPD Prediction on High-Dimensional Feature Space [J].
Banda, Srinivas Raja Banda ;
Babu, Tummala Ranga .
DATA ENGINEERING AND COMMUNICATION TECHNOLOGY, ICDECT-2K19, 2020, 1079 :663-675
[27]   Homogeneity tests of covariance matrices with high-dimensional longitudinal data [J].
Zhong, Ping-Shou ;
Li, Runze ;
Santo, Shawn .
BIOMETRIKA, 2019, 106 (03) :619-634
[28]   Multistage feature selection approach for high-dimensional cancer data [J].
Alkuhlani, Alhasan ;
Nassef, Mohammad ;
Farag, Ibrahim .
SOFT COMPUTING, 2017, 21 (22) :6895-6906
[29]   Iterative Variable Selection for High-Dimensional Data: Prediction of Pathological Response in Triple-Negative Breast Cancer [J].
Laria, Juan C. ;
Aguilera-Morillo, M. Carmen ;
Alvarez, Enrique ;
Lillo, Rosa E. ;
Lopez-Taruella, Sara ;
del Monte-Millan, Maria ;
Picornell, Antonio C. ;
Martin, Miguel ;
Romo, Juan .
MATHEMATICS, 2021, 9 (03) :1-14
[30]   A scalable software solution for anonymizing high-dimensional biomedical data [J].
Meurers, Thierry ;
Bild, Raffael ;
Do, Kieu-Mi ;
Prasser, Fabian .
GIGASCIENCE, 2021, 10 (10)