Fast and scalable ensemble learning method for versatile polygenic risk prediction

被引:0
作者
Chen, Tony [1 ]
Zhang, Haoyu [2 ]
Mazumder, Rahul [3 ]
Lin, Xihong [1 ,4 ]
机构
[1] Harvard TH Chan Sch Publ Hlth, Dept Biostat, Boston, MA 02215 USA
[2] NCI, Div Canc Epidemiol & Genet, Bethesda, MD 20814 USA
[3] MIT, Sloan Sch Management, Operat Res & Stat Grp, Cambridge, MA 02139 USA
[4] Harvard Univ, Dept Stat, Cambridge, MA 02138 USA
关键词
polygenic risk scores; ensemble learning; L0Learn; penalized regression; LINKAGE DISEQUILIBRIUM; SELECTION; REGRESSION; ACCURACY; DISEASE; MODELS; REGULARIZATION; ASSOCIATION; INSIGHTS; COMMON;
D O I
10.1073/pnas.2403210121
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Polygenic risk scores (PRS) enhance population risk stratification and advance personalized medicine, but existing methods face several limitations, encompassing issues related to computational burden, predictive accuracy, and adaptability to a wide range of genetic architectures. To address these issues, we propose Aggregated L0Learn using Summary- level data (ALL- Sum), a fast and scalable ensemble learning method for computing PRS using summary statistics from genome-wide association studies (GWAS). ALL- Sum leverages a L0L2 penalized regression and ensemble learning across tuning parameters to flexibly model traits with diverse genetic architectures. In extensive large- scale simulations across a wide range of polygenicity and GWAS sample sizes, ALL- Sum consistently outperformed popular alternative methods in terms of prediction accuracy, runtime, and memory usage by 10%, 20- fold, and threefold, respectively, and demonstrated robustness to diverse genetic architectures. We validated the performance of ALL- Sum in real data analysis of 11 complex traits using GWAS summary statistics from nine data sources, including the Global Lipids Genetics Consortium, Breast Cancer Association Consortium, and FinnGen Biobank, with validation in the UK Biobank. Our results show that on average, ALL- Sum obtained PRS with 25% higher accuracy on average, with 15 times faster computation and half the memory than the current state- of- the- art methods, and had robust performance across a wide range of traits and diseases. Furthermore, our method demonstrates stable prediction when using linkage disequilibrium computed from different data sources. ALL- Sum is available as a user- friendly R software package with publicly available reference data for streamlined analysis.
引用
收藏
页数:9
相关论文
共 50 条
[21]   Evaluation of polygenic scoring methods in five biobanks shows larger variation between biobanks than methods and finds benefits of ensemble learning [J].
Monti, Remo ;
Eick, Lisa ;
Hudjashov, Georgi ;
Lall, Kristi ;
Kanoni, Stavroula ;
Wolford, Brooke N. ;
Wingfield, Benjamin ;
Pain, Oliver ;
Wharrie, Sophie ;
Jermy, Bradley ;
McMahon, Aoife ;
Hartonen, Tuomo ;
Heyne, Henrike ;
Mars, Nina ;
Lambert, Samuel ;
Hveem, Kristian ;
Inouye, Michael ;
van Heel, David A. ;
Magi, Reedik ;
Marttinen, Pekka ;
Ripatti, Samuli ;
Ganna, Andrea ;
Lippert, Christoph .
AMERICAN JOURNAL OF HUMAN GENETICS, 2024, 111 (07) :1431-1447
[22]   Scalable Ensemble Learning by Adaptive Sampling [J].
Chen, Jianhua .
2012 11TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA 2012), VOL 1, 2012, :622-625
[23]   Assessment and prediction of regional climate based on a multimodel ensemble machine learning method [J].
Fu, Yinghao ;
Zhuang, Haoran ;
Shen, Xiaojing ;
Li, Wangcheng .
CLIMATE DYNAMICS, 2023, 61 (9-10) :4139-4158
[24]   Ensemble learning for integrative prediction of genetic values with genomic variants [J].
Gu, Lin-Lin ;
Yang, Run-Qing ;
Wang, Zhi-Yong ;
Jiang, Dan ;
Fang, Ming .
BMC BIOINFORMATICS, 2024, 25 (01)
[25]   Pan-cancer analysis demonstrates that integrating polygenic risk scores with modifiable risk factors improves risk prediction [J].
Kachuri, Linda ;
Graff, Rebecca E. ;
Smith-Byrne, Karl ;
Meyers, Travis J. ;
Rashkin, Sara R. ;
Ziv, Elad ;
Witte, John S. ;
Johansson, Mattias .
NATURE COMMUNICATIONS, 2020, 11 (01)
[26]   Development of Ensemble Learning Method Considering Applicability Domains [J].
Sato, Keigo ;
Kaneko, Hiromasa .
JOURNAL OF COMPUTER CHEMISTRY-JAPAN, 2019, 18 (04) :187-193
[27]   Ensemble Learning Based on Hybrid Deep Learning Model for Heart Disease Early Prediction [J].
Almulihi, Ahmed ;
Saleh, Hager ;
Hussien, Ali Mohamed ;
Mostafa, Sherif ;
El-Sappagh, Shaker ;
Alnowaiser, Khaled ;
Ali, Abdelmgeid A. ;
Refaat Hassan, Moatamad .
DIAGNOSTICS, 2022, 12 (12)
[28]   A Cloud-Based Optimized Ensemble Model for Risk Prediction of Diabetic Progression-An Azure Machine Learning Perspective [J].
Daliya, V. K. ;
Ramesh, T. K. .
IEEE ACCESS, 2025, 13 :11560-11575
[29]   A Prediction Method of Cable Crosstalk in Electronic Systems with Ensemble Learning [J].
Yang, Xu ;
Zhou, Dejian ;
Song, Wei ;
She, Yulai ;
Chen, Xiaoyong .
ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING, 2022, 47 (03) :2987-3000
[30]   A novel classification method based on the ensemble learning and feature selection for aluminophosphate structural prediction [J].
Yao, Minghai ;
Qi, Miao ;
Li, Jinsong ;
Kong, Jun .
MICROPOROUS AND MESOPOROUS MATERIALS, 2014, 186 :201-206