Unsupervised representation learning on high-dimensional clinical data improves genomic discovery and prediction

被引:8
作者
Yun, Taedong [1 ]
Cosentino, Justin [2 ]
Behsaz, Babak [1 ]
McCaw, Zachary R. [2 ,13 ]
Hill, Davin [3 ,4 ]
Luben, Robert [5 ,6 ,7 ]
Lai, Dongbing [8 ]
Bates, John [9 ]
Yang, Howard [2 ]
Schwantes-An, Tae-Hwi [8 ,10 ]
Zhou, Yuchen [1 ]
Khawaja, Anthony P. [5 ,6 ,7 ]
Carroll, Andrew [2 ]
Hobbs, Brian D. [4 ,11 ,12 ]
Cho, Michael H. [4 ,11 ,12 ]
McLean, Cory Y. [1 ]
Hormozdiari, Farhad [1 ]
机构
[1] Google Res, Cambridge, MA 02142 USA
[2] Google Res, Mountain View, CA USA
[3] Northeastern Univ, Dept Elect & Comp Engn, Boston, MA USA
[4] Brigham & Womens Hosp, Channing Div Network Med, Boston, MA USA
[5] Moorfields Eye Hosp, NIHR Biomed Res Ctr, London, England
[6] Univ Coll London UCL, Inst Ophthalmol, London, England
[7] Univ Cambridge, MRC Epidemiol Unit, Cambridge, England
[8] Indiana Univ Sch Med, Dept Med & Mol Genet, Indianapolis, IN USA
[9] Verily Life Sci, South San Francisco, CA USA
[10] Indiana Univ Sch Med, Dept Med, Div Cardiol, Indianapolis, IN USA
[11] Brigham & Womens Hosp, Div Pulm & Crit Care Med, Boston, MA USA
[12] Harvard Med Sch, Boston, MA USA
[13] Insitro, South San Francisco, CA USA
基金
英国医学研究理事会; 美国国家卫生研究院; 英国科研创新办公室;
关键词
OBSTRUCTIVE PULMONARY-DISEASE; WIDE ASSOCIATION; CORRELATED PHENOTYPES; RISK; COPD; PHOTOPLETHYSMOGRAPHY; INSIGHTS; POWER; SET;
D O I
10.1038/s41588-024-01831-6
中图分类号
Q3 [遗传学];
学科分类号
071007 ; 090102 ;
摘要
Although high-dimensional clinical data (HDCD) are increasingly available in biobank-scale datasets, their use for genetic discovery remains challenging. Here we introduce an unsupervised deep learning model, Representation Learning for Genetic Discovery on Low-Dimensional Embeddings (REGLE), for discovering associations between genetic variants and HDCD. REGLE leverages variational autoencoders to compute nonlinear disentangled embeddings of HDCD, which become the inputs to genome-wide association studies (GWAS). REGLE can uncover features not captured by existing expert-defined features and enables the creation of accurate disease-specific polygenic risk scores (PRSs) in datasets with very few labeled data. We apply REGLE to perform GWAS on respiratory and circulatory HDCD-spirograms measuring lung function and photoplethysmograms measuring blood volume changes. REGLE replicates known loci while identifying others not previously detected. REGLE are predictive of overall survival, and PRSs constructed from REGLE loci improve disease prediction across multiple biobanks. Overall, REGLE contain clinically relevant information beyond that captured by existing expert-defined features, leading to improved genetic discovery and disease prediction. Representation Learning for Genetic Discovery on Low-Dimensional Embeddings (REGLE) uses machine learning to generate low-dimensional representations of healthcare data. Applied to lung spirograms and blood volume photoplethysmograms, REGLE factors capture additional information beyond expert-defined features, suggesting the utility of this approach.
引用
收藏
页码:1604 / 1613
页数:27
相关论文
共 50 条
[31]   Small sample sizes: A big data problem in high-dimensional data analysis [J].
Konietschke, Frank ;
Schwab, Karima ;
Pauly, Markus .
STATISTICAL METHODS IN MEDICAL RESEARCH, 2021, 30 (03) :687-701
[32]   Anytime Subgroup Discovery in High Dimensional Numerical Data [J].
Mathonat, Romain ;
Nurbakova, Diana ;
Boulicaut, Jean-Francois ;
Kaytoue, Mehdi .
2021 IEEE 8TH INTERNATIONAL CONFERENCE ON DATA SCIENCE AND ADVANCED ANALYTICS (DSAA), 2021,
[33]   Missing Data Recovery for High-Dimensional Signals With Nonlinear Low-Dimensional Structures [J].
Gao, Pengzhi ;
Wang, Meng ;
Chow, Joe H. ;
Berger, Matthew ;
Seversky, Lee M. .
IEEE TRANSACTIONS ON SIGNAL PROCESSING, 2017, 65 (20) :5421-5436
[34]   Imbalanced target prediction with pattern discovery on clinical data repositories [J].
Chan, Tak-Ming ;
Li, Yuxi ;
Chiau, Choo-Chiap ;
Zhu, Jane ;
Jiang, Jie ;
Huo, Yong .
BMC MEDICAL INFORMATICS AND DECISION MAKING, 2017, 17
[35]   Penalised empirical likelihood for the additive hazards model with high-dimensional data [J].
Fang, Jianglin ;
Liu, Wanrong ;
Lu, Xuewen .
JOURNAL OF NONPARAMETRIC STATISTICS, 2017, 29 (02) :326-345
[36]   LiFoL: An Efficient Framework for Financial Distress Prediction in High-Dimensional Unbalanced Scenario [J].
Chen, Ying ;
Kuang, Xiaojun ;
Guo, Jifeng .
IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, 2024, 11 (02) :2784-2795
[37]   Multiple optimized ensemble learning for high-dimensional imbalanced credit scoring datasets [J].
Lenka, Sudhansu R. ;
Bisoy, Sukant Kishoro ;
Priyadarshini, Rojalina .
KNOWLEDGE AND INFORMATION SYSTEMS, 2024, 66 (09) :5429-5457
[38]   High-dimensional stochastic control models for newsvendor problems and deep learning resolution [J].
Ma, Jingtang ;
Yang, Shan .
ANNALS OF OPERATIONS RESEARCH, 2024, 339 (1-2) :789-811
[39]   Degree-Heterogeneous Latent Class Analysis for High-Dimensional Discrete Data [J].
Lyu, Zhongyuan ;
Chen, Ling ;
Gu, Yuqi .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2025,
[40]   Integrative analysis of individual-level data and high-dimensional summary statistics [J].
Fu, Sheng ;
Deng, Lu ;
Zhang, Han ;
Qin, Jing ;
Yu, Kai .
BIOINFORMATICS, 2023, 39 (04)