Cardiovascular disease risk prediction using automated machine learning: A prospective study of 423,604 UK Biobank participants

被引:346
作者
Alaa, Ahmed M. [1 ]
Bolton, Thomas [2 ,3 ]
Di Angelantonio, Emanuele [2 ,3 ]
Rudd, James H. F. [4 ,5 ]
van der Schaar, Mihaela [1 ,6 ,7 ]
机构
[1] Univ Calif Los Angeles, Los Angeles, CA 90024 USA
[2] Univ Cambridge, Dept Publ Hlth & Primary Care, Cambridge, England
[3] Univ Cambridge, NIHR, BTRU Donor Hlth & Genom, Cambridge, England
[4] Univ Cambridge, Dept Cardiovasc Med, Cambridge, England
[5] Cambridge Univ Hosp NHS Fdn Trust, Cambridge, England
[6] Univ Oxford, Oxford, England
[7] Alan Turing Inst, London, England
来源
PLOS ONE | 2019年 / 14卷 / 05期
基金
美国国家科学基金会; 英国惠康基金; 英国工程与自然科学研究理事会;
关键词
ASSOCIATION TASK-FORCE; HEART-ASSOCIATION; AMERICAN-COLLEGE; 10-YEAR RISK; FOLLOW-UP; PREVENTION; FRAMINGHAM; CARDIOLOGY; GUIDELINE; EQUATIONS;
D O I
10.1371/journal.pone.0213653
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Background Identifying people at risk of cardiovascular diseases (CVD) is a cornerstone of preventative cardiology. Risk prediction models currently recommended by clinical guidelines are typically based on a limited number of predictors with sub-optimal performance across all patient groups. Data-driven techniques based on machine learning (ML) might improve the performance of risk predictions by agnostically discovering novel risk predictors and learning the complex interactions between them. We tested (1) whether ML techniques based on a state-of-the-art automated ML framework (AutoPrognosis) could improve CVD risk prediction compared to traditional approaches, and (2) whether considering non-traditional variables could increase the accuracy of CVD risk predictions. Methods and findings Using data on 423,604 participants without CVD at baseline in UK Biobank, we developed a ML-based model for predicting CVD risk based on 473 available variables. Our ML-based model was derived using AutoPrognosis, an algorithmic tool that automatically selects and tunes ensembles of ML modeling pipelines (comprising data imputation, feature processing, classification and calibration algorithms). We compared our model with a well-established risk prediction algorithm based on conventional CVD risk factors (Framingham score), a Cox proportional hazards (PH) model based on familiar risk factors (i.e, age, gender, smoking status, systolic blood pressure, history of diabetes, reception of treatments for hypertension and body mass index), and a Cox PH model based on all of the 473 available variables. Predictive performances were assessed using area under the receiver operating characteristic curve (AUC-ROC). Overall, our AutoPrognosis model improved risk prediction (AUC-ROC: 0.774, 95% CI: 0.768-0.780) compared to Framingham score (AUC-ROC: 0.724, 95% CI: 0.720-0.728, p < 0.001), Cox PH model with conventional risk factors (AUC-ROC: 0.734, 95% CI: 0.729-0.739, p < 0.001), and Cox PH model with all UK Biobank variables (AUC-ROC: 0.758, 95% CI: 0.753-0.763, p < 0.001). Out of 4,801 CVD cases recorded within 5 years of baseline, AutoPrognosis was able to correctly predict 368 more cases compared to the Framingham score. Our AutoPrognosis model included predictors that are not usually considered in existing risk prediction models, such as the individuals' usual walking pace and their self-reported overall health rating. Furthermore, our model improved risk prediction in potentially relevant sub-populations, such as in individuals with history of diabetes. We also highlight the relative benefits accrued from including more information into a predictive model (information gain) as compared to the benefits of using more complex models (modeling gain). Conclusions Our AutoPrognosis model improves the accuracy of CVD risk prediction in the UK Biobank population. This approach performs well in traditionally poorly served patient subgroups. Additionally, AutoPrognosis uncovered novel predictors for CVD disease that may now be tested in prospective studies. We found that the "information gain" achieved by considering more risk factors in the predictive model was significantly higher than the "modeling gain" achieved by adopting complex predictive models.
引用
收藏
页数:17
相关论文
共 44 条
  • [1] Challenges of linking to routine healthcare records in UK Biobank
    Adamska, Ligia
    Allen, Naomi
    Flaig, Robin
    Sudlow, Cathie
    Lay, Michael
    Landray, Martin
    [J]. TRIALS, 2015, 16
  • [2] Machine Learning Methods Improve Prognostication, Identify Clinically Distinct Phenotypes, and Detect Heterogeneity in Response to Therapy in a Large Cohort of Heart Failure Patients
    Ahmad, Tariq
    Lund, Lars H.
    Rao, Pooja
    Ghosh, Rohit
    Warier, Prashant
    Vaccaro, Benjamin
    Dahlstrom, Ulf
    O'Connor, Christopher M.
    Felker, G. Michael
    Desai, Nihar R.
    [J]. JOURNAL OF THE AMERICAN HEART ASSOCIATION, 2018, 7 (08):
  • [3] Alaa AM, 2018, INT C MACH LEARN
  • [4] A high ankle-brachial index is associated with increased cardiovascular disease morbidity and lower quality of life
    Allison, Matthew A.
    Hiatt, William R.
    Hirsch, Alan T.
    Coll, Joseph R.
    Criqui, Michael H.
    [J]. JOURNAL OF THE AMERICAN COLLEGE OF CARDIOLOGY, 2008, 51 (13) : 1292 - 1298
  • [5] Cardiovascular Event Prediction by Machine Learning The Multi-Ethnic Study of Atherosclerosis
    Ambale-Venkatesh, Bharath
    Yang, Xiaoying
    Wu, Colin O.
    Liu, Kiang
    Hundley, W. Gregory
    McClelland, Robyn
    Gomes, Antoinette S.
    Folsom, Aaron R.
    Shea, Steven
    Guallar, Eliseo
    Bluemke, David A.
    Lima, Joao A. C.
    [J]. CIRCULATION RESEARCH, 2017, 121 (09) : 1092 - +
  • [6] Simple scoring scheme for calculating the risk of acute coronary events based on the 10-year follow-up of the Prospective Cardiovascular Munster (PROCAM) study
    Assmann, G
    Cullen, P
    Schulte, H
    [J]. CIRCULATION, 2002, 105 (03) : 310 - 315
  • [7] Random forests
    Breiman, L
    [J]. MACHINE LEARNING, 2001, 45 (01) : 5 - 32
  • [8] Buse JB, 2007, DIABETES CARE, V30, P162, DOI [10.2337/dc07-9917, 10.1161/CIRCULATIONAHA.106.179294]
  • [9] XGBoost: A Scalable Tree Boosting System
    Chen, Tianqi
    Guestrin, Carlos
    [J]. KDD'16: PROCEEDINGS OF THE 22ND ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2016, : 785 - 794
  • [10] Framingham, SCORE, and DECODE risk equations do not provide reliable cardiovascular risk estimates in type 2 diabetes
    Coleman, Ruth L.
    Stevens, Richard J.
    Retnakaran, Ravi
    Holman, Rury R.
    [J]. DIABETES CARE, 2007, 30 (05) : 1292 - 1294