A data-driven approach to predicting diabetes and cardiovascular disease with machine learning

被引:209
|
作者
Dinh, An [1 ]
Miertschin, Stacey [2 ]
Young, Amber [3 ]
Mohanty, Somya D. [4 ]
机构
[1] Eastern Oregon Univ, Dept Math & Comp Sci, La Grande, OR USA
[2] Winona State Univ, Dept Math & Stat, Winona, MN 55987 USA
[3] Purdue Univ, Dept Stat, W Lafayette, IN 47907 USA
[4] Univ N Carolina, Dept Comp Sci, Greensboro, NC 27412 USA
关键词
Machine learning; Health analytics; Ensemble learning; Feature learning; ISCHEMIC-HEART-DISEASE; RISK-FACTOR; REGRESSION; CHOLESTEROL; DIAGNOSIS; GLUCOSE;
D O I
10.1186/s12911-019-0918-5
中图分类号
R-058 [];
学科分类号
摘要
Background Diabetes and cardiovascular disease are two of the main causes of death in the United States. Identifying and predicting these diseases in patients is the first step towards stopping their progression. We evaluate the capabilities of machine learning models in detecting at-risk patients using survey data (and laboratory results), and identify key variables within the data contributing to these diseases among the patients. Methods Our research explores data-driven approaches which utilize supervised machine learning models to identify patients with such diseases. Using the National Health and Nutrition Examination Survey (NHANES) dataset, we conduct an exhaustive search of all available feature variables within the data to develop models for cardiovascular, prediabetes, and diabetes detection. Using different time-frames and feature sets for the data (based on laboratory data), multiple machine learning models (logistic regression, support vector machines, random forest, and gradient boosting) were evaluated on their classification performance. The models were then combined to develop a weighted ensemble model, capable of leveraging the performance of the disparate models to improve detection accuracy. Information gain of tree-based models was used to identify the key variables within the patient data that contributed to the detection of at-risk patients in each of the diseases classes by the data-learned models. Results The developed ensemble model for cardiovascular disease (based on 131 variables) achieved an Area Under - Receiver Operating Characteristics (AU-ROC) score of 83.1% using no laboratory results, and 83.9% accuracy with laboratory results. In diabetes classification (based on 123 variables), eXtreme Gradient Boost (XGBoost) model achieved an AU-ROC score of 86.2% (without laboratory data) and 95.7% (with laboratory data). For pre-diabetic patients, the ensemble model had the top AU-ROC score of 73.7% (without laboratory data), and for laboratory based data XGBoost performed the best at 84.4%. Top five predictors in diabetes patients were 1) waist size, 2) age, 3) self-reported weight, 4) leg length, and 5) sodium intake. For cardiovascular diseases the models identified 1) age, 2) systolic blood pressure, 3) self-reported weight, 4) occurrence of chest pain, and 5) diastolic blood pressure as key contributors. Conclusion We conclude machine learned models based on survey questionnaire can provide an automated identification mechanism for patients at risk of diabetes and cardiovascular diseases. We also identify key contributors to the prediction, which can be further explored for their implications on electronic health records.
引用
收藏
页数:15
相关论文
共 50 条
  • [1] A data-driven approach to predicting diabetes and cardiovascular disease with machine learning
    An Dinh
    Stacey Miertschin
    Amber Young
    Somya D. Mohanty
    BMC Medical Informatics and Decision Making, 19
  • [2] A Data-Driven Approach to Predicting Recreational Activity Participation Using Machine Learning
    Lee, Seungbak
    Kang, Minsoo
    RESEARCH QUARTERLY FOR EXERCISE AND SPORT, 2024, 95 (04) : 873 - 885
  • [3] Chinese diabetes datasets for data-driven machine learning
    Zhao, Qinpei
    Zhu, Jinhao
    Shen, Xuan
    Lin, Chuwen
    Zhang, Yinjia
    Liang, Yuxiang
    Cao, Baige
    Li, Jiangfeng
    Liu, Xiang
    Rao, Weixiong
    Wang, Congrong
    SCIENTIFIC DATA, 2023, 10 (01)
  • [4] Chinese diabetes datasets for data-driven machine learning
    Qinpei Zhao
    Jinhao Zhu
    Xuan Shen
    Chuwen Lin
    Yinjia Zhang
    Yuxiang Liang
    Baige Cao
    Jiangfeng Li
    Xiang Liu
    Weixiong Rao
    Congrong Wang
    Scientific Data, 10
  • [5] A data-driven machine learning approach to predicting stacking faulting energy in austenitic steels
    N. Chaudhary
    A. Abu-Odeh
    I. Karaman
    R. Arróyave
    Journal of Materials Science, 2017, 52 : 11048 - 11076
  • [6] A data-driven machine learning approach to predicting stacking faulting energy in austenitic steels
    Chaudhary, N.
    Abu-Odeh, A.
    Karaman, I.
    Arroyave, R.
    JOURNAL OF MATERIALS SCIENCE, 2017, 52 (18) : 11048 - 11076
  • [7] Data-Driven Machine Learning Approach for Predicting Missing Values in Large Data Sets: A Comparison Study
    Elezaj, Ogerta
    Yildirim, Sule
    Kalemi, Edlira
    MACHINE LEARNING, OPTIMIZATION, AND BIG DATA, MOD 2017, 2018, 10710 : 268 - 285
  • [8] Data-Driven Machine Learning Approach for Predicting the Higher Heating Value of Different Biomass Classes
    Afolabi, Inioluwa Christianah
    Epelle, Emmanuel, I
    Gunes, Burcu
    Gulec, Fatih
    Okolie, Jude A.
    CLEAN TECHNOLOGIES, 2022, 4 (04): : 1227 - 1241
  • [9] Data-driven machine learning approach for predicting the capacitance of graphene-based supercapacitor electrodes
    Saad, Ahmed G.
    Emad-Eldeen, Ahmed
    Tawfik, Wael Z.
    El-Deen, Ahmed G.
    JOURNAL OF ENERGY STORAGE, 2022, 55
  • [10] Predicting failure pressure of corroded gas pipelines: A data-driven approach using machine learning
    Xiao, Rui
    Zayed, Tarek
    Meguid, Mohamed A.
    Sushama, Laxmi
    PROCESS SAFETY AND ENVIRONMENTAL PROTECTION, 2024, 184 : 1424 - 1441