Identification of Potential Type II Diabetes in a Large-Scale Chinese Population Using a Systematic Machine Learning Framework

被引:15
作者
Xue, Mingyue [1 ,2 ]
Su, Yinxia [2 ]
Li, Chen [3 ]
Wang, Shuxia [4 ]
Yao, Hua [4 ]
机构
[1] Xinjiang Med Univ, Hosp Tradit Chinese Med, Clin Med Coll 4, Urumqi, Peoples R China
[2] Xinjiang Med Univ, Coll Publ Hlth, Urumqi, Peoples R China
[3] Xinjiang Med Univ, Affiliated Hosp 1, Urumqi, Peoples R China
[4] Xinjiang Med Univ, Affiliated Hosp 1, Ctr Hlth Management, Urumqi, Peoples R China
基金
中国国家自然科学基金;
关键词
LIFE-STYLE INTERVENTIONS; RISK STRATIFICATION; LOGISTIC-REGRESSION; ALCOHOL-CONSUMPTION; PREVENTION PROGRAM; FEATURE-SELECTION; DECISION-TREE; FOLLOW-UP; MELLITUS; CLASSIFICATION;
D O I
10.1155/2020/6873891
中图分类号
R5 [内科学];
学科分类号
1002 ; 100201 ;
摘要
Background. An estimated 425 million people globally have diabetes, accounting for 12% of the world's health expenditures, and the number continues to grow, placing a huge burden on the healthcare system, especially in those remote, underserved areas. Methods. A total of 584,168 adult subjects who have participated in the national physical examination were enrolled in this study. The risk factors for type II diabetes mellitus (T2DM) were identified bypvalues and odds ratio, using logistic regression (LR) based on variables of physical measurement and a questionnaire. Combined with the risk factors selected by LR, we used a decision tree, a random forest, AdaBoost with a decision tree (AdaBoost), and an extreme gradient boosting decision tree (XGBoost) to identify individuals with T2DM, compared the performance of the four machine learning classifiers, and used the best-performing classifier to output the degree of variables' importance scores of T2DM. Results. The results indicated that XGBoost had the best performance (accuracy = 0.906, precision = 0.910, recall = 0.902, F-1 = 0.906, and AUC = 0.968). The degree of variables' importance scores in XGBoost showed that BMI was the most significant feature, followed by age, waist circumference, systolic pressure, ethnicity, smoking amount, fatty liver, hypertension, physical activity, drinking status, dietary ratio (meat to vegetables), drink amount, smoking status, and diet habit (oil loving). Conclusions. We proposed a classifier based on LR-XGBoost which used fourteen variables of patients which are easily obtained and noninvasive as predictor variables to identify potential incidents of T2DM. The classifier can accurately screen the risk of diabetes in the early phrase, and the degree of variables' importance scores gives a clue to prevent diabetes occurrence.
引用
收藏
页数:12
相关论文
共 69 条
  • [1] Smoking and the risk of type 2 diabetes in Japan: A systematic review and meta-analysis
    Akter, Shamima
    Goto, Atsushi
    Mizoue, Tetsuya
    [J]. JOURNAL OF EPIDEMIOLOGY, 2017, 27 (12) : 553 - 561
  • [2] [Anonymous], 2016, ISBN, V978, P88
  • [3] Food sources of sodium, saturated fat, and added sugar in the Physical Activity and Nutrition for Diabetes in Alberta (PANDA) trial
    Asaad, Ghada
    Chan, Catherine B.
    [J]. APPLIED PHYSIOLOGY NUTRITION AND METABOLISM, 2017, 42 (12) : 1270 - 1276
  • [4] Comparative study of multiclass classification methods on light microscopic images for hepatic schistosomiasis fibrosis diagnosis
    Ashour A.S.
    Hawas A.R.
    Guo Y.
    [J]. Health Information Science and Systems, 6 (1)
  • [5] Automated variable selection methods for logistic regression produced unstable models for predicting acute myocardial infarction mortality
    Austin, PC
    Tu, JV
    [J]. JOURNAL OF CLINICAL EPIDEMIOLOGY, 2004, 57 (11) : 1138 - 1146
  • [6] Calcium detection, its quantification, and grayscale morphology-based risk stratification using machine learning in multimodality big data coronary and carotid scans: A review
    Banchhor, Sumit K.
    Londhe, Narendra D.
    Araki, Tadashi
    Saba, Luca
    Radeva, Petia
    Khanna, Narendra N.
    Suri, Jasjit S.
    [J]. COMPUTERS IN BIOLOGY AND MEDICINE, 2018, 101 : 184 - 198
  • [7] Wall-based measurement features provides an improved IVUS coronary artery risk assessment when fused with plaque texture-based features during machine learning paradigm
    Banchhor, Sumit K.
    Londhe, Narendra D.
    Araki, Tadashi
    Saba, Luca
    Radeva, Petia
    Laird, John R.
    Suri, Jasjit S.
    [J]. COMPUTERS IN BIOLOGY AND MEDICINE, 2017, 91 : 198 - 212
  • [8] Audiovisual emotion recognition using ANOVA feature selection method and multi-classifier neural networks
    Bejani, Mahdi
    Gharavian, Davood
    Charkari, Nasrollah Moghaddam
    [J]. NEURAL COMPUTING & APPLICATIONS, 2014, 24 (02) : 399 - 412
  • [9] Bray GA, 1999, DIABETES CARE, V22, P623
  • [10] Random forests
    Breiman, L
    [J]. MACHINE LEARNING, 2001, 45 (01) : 5 - 32