Non-linear machine learning models incorporating SNPs and PRS improve polygenic prediction in diverse human populations

被引:0
作者
Michael Elgart
Genevieve Lyons
Santiago Romero-Brufau
Nuzulul Kurniansyah
Jennifer A. Brody
Xiuqing Guo
Henry J. Lin
Laura Raffield
Yan Gao
Han Chen
Paul de Vries
Donald M. Lloyd-Jones
Leslie A. Lange
Gina M. Peloso
Myriam Fornage
Jerome I. Rotter
Stephen S. Rich
Alanna C. Morrison
Bruce M. Psaty
Daniel Levy
Susan Redline
Tamar Sofer
机构
[1] Brigham and Women’s Hospital,Division of Sleep and Circadian Disorders
[2] Harvard Medical School,Department of Medicine
[3] Harvard T.H. Chan School of Public Health,Department of Biostatistics
[4] Mayo Clinic,Department of Medicine
[5] University of Washington,Cardiovascular Health Research Unit, Department of Medicine
[6] The Institute for Translational Genomics and Population Sciences,Department of Genetics
[7] Department of Pediatrics,The Jackson Heart Study
[8] The Lundquist Institute for Biomedical Innovation at Harbor-UCLA Medical Center,Human Genetics Center, Department of Epidemiology
[9] University of North Carolina,Center for Precision Health, School of Biomedical Informatics
[10] University of Mississippi Medical Center,Department of Preventive Medicine
[11] Human Genetics,Department of Medicine
[12] and Environmental Sciences,Department of Biostatistics
[13] School of Public Health,Center for Public Health Genomics
[14] The University of Texas Health Science Center at Houston,Cardiovascular Health Research Unit, Departments of Medicine
[15] The University of Texas Health Science Center at Houston,The Population Sciences Branch of the National Heart
[16] Northwestern University,undefined
[17] University of Colorado Denver,undefined
[18] Anschutz Medical Campus,undefined
[19] Boston University School of Public Health,undefined
[20] Brown Foundation Institute of Molecular Medicine,undefined
[21] McGovern Medical School,undefined
[22] University of Texas Health Science Center at Houston,undefined
[23] University of Virginia School of Medicine,undefined
[24] Epidemiology,undefined
[25] and Health Services,undefined
[26] University of Washington,undefined
[27] Lung and Blood Institute,undefined
[28] The Framingham Heart Study,undefined
来源
Communications Biology | / 5卷
关键词
D O I
暂无
中图分类号
学科分类号
摘要
Polygenic risk scores (PRS) are commonly used to quantify the inherited susceptibility for a trait, yet they fail to account for non-linear and interaction effects between single nucleotide polymorphisms (SNPs). We address this via a machine learning approach, validated in nine complex phenotypes in a multi-ancestry population. We use an ensemble method of SNP selection followed by gradient boosted trees (XGBoost) to allow for non-linearities and interaction effects. We compare our results to the standard, linear PRS model developed using PRSice, LDpred2, and lassosum2. Combining a PRS as a feature in an XGBoost model results in a relative increase in the percentage variance explained compared to the standard linear PRS model by 22% for height, 27% for HDL cholesterol, 43% for body mass index, 50% for sleep duration, 58% for systolic blood pressure, 64% for total cholesterol, 66% for triglycerides, 77% for LDL cholesterol, and 100% for diastolic blood pressure. Multi-ancestry trained models perform similarly to specific racial/ethnic group trained models and are consistently superior to the standard linear PRS models. This work demonstrates an effective method to account for non-linearities and interaction effects in genetics-based prediction models.
引用
收藏
相关论文
共 101 条
[1]  
Torkamani A(2018)The personal and clinical utility of polygenic risk scores Nat. Rev. Genet. 19 581-590
[2]  
Wineinger NE(2020)Tutorial: a guide to performing polygenic risk score analyses Nat. Protoc. 15 2759-2772
[3]  
Topol EJ(2014)Detection and replication of epistasis influencing transcription in humans Nature 508 249-253
[4]  
Choi SW(2018)Haplotype-based genome-wide prediction models exploit local epistatic interactions among markers G3 8 1687-1699
[5]  
Mak TS-H(2020)A novel mapping strategy utilizing mouse chromosome substitution strains identifies multiple epistatic interactions that regulate complex traits G3 10 4553-4563
[6]  
O’Reilly PF(2001)Genetic analysis of case/control data using estimated haplotype frequencies: application to APOE locus variation and Alzheimer’s disease Genome Res. 11 143-151
[7]  
Hemani G(2014)APOL1 kidney risk alleles: population genetics and disease associations Adv. Chronic Kidney Dis. 21 426-433
[8]  
Jiang Y(2019)Associations between SLC16A11 variants and diabetes in the Hispanic Community Health Study/Study of Latinos (HCHS/SOL) Sci. Rep. 9 50-62
[9]  
Schmidt RH(2019)Polygenic prediction via Bayesian regression and continuous shrinkage priors Nat. Commun. 10 635-649
[10]  
Reif JC(2019)Generalizing polygenic risk scores from Europeans to Hispanics/Latinos Genet. Epidemiol. 43 267-49