Diabetes Prediction Using Ensembling of Different Machine Learning Classifiers

被引:143
作者
Hasan, Md. Kamrul [1 ]
Alam, Md. Ashraful [1 ]
Das, Dola [2 ]
Hossain, Eklas [3 ]
Hasan, Mahmudul [2 ]
机构
[1] Khulna Univ Engn & Technol, Dept Elect & Elect Engn, Khulna 9203, Bangladesh
[2] Khulna Univ Engn & Technol, Dept Comp Sci & Engn, Khulna 9203, Bangladesh
[3] Oregon Inst Technol, Dept Elect Engn & Renewable Energy, Oregon Renewable Energy Ctr OREC, Klamath Falls, OR 97601 USA
来源
IEEE ACCESS | 2020年 / 8卷
关键词
Diabetes prediction; ensembling classifier; machine learning; multilayer perceptron; missing values and outliers; Pima Indian Diabetic dataset; CROSS-VALIDATION; NEURAL-NETWORKS; MELLITUS; CLASSIFICATION; RISK;
D O I
10.1109/ACCESS.2020.2989857
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Diabetes, also known as chronic illness, is a group of metabolic diseases due to a high level of sugar in the blood over a long period. The risk factor and severity of diabetes can be reduced significantly if the precise early prediction is possible. The robust and accurate prediction of diabetes is highly challenging due to the limited number of labeled data and also the presence of outliers (or missing values) in the diabetes datasets. In this literature, we are proposing a robust framework for diabetes prediction where the outlier rejection, filling the missing values, data standardization, feature selection, K-fold cross-validation, and different Machine Learning (ML) classifiers (k-nearest Neighbour, Decision Trees, Random Forest, AdaBoost, Naive Bayes, and XGBoost) and Multilayer Perceptron (MLP) were employed. The weighted ensembling of different ML models is also proposed, in this literature, to improve the prediction of diabetes where the weights are estimated from the corresponding Area Under ROC Curve (AUC) of the ML model. AUC is chosen as the performance metric, which is then maximized during hyperparameter tuning using the grid search technique. All the experiments, in this literature, were conducted under the same experimental conditions using the Pima Indian Diabetes Dataset. From all the extensive experiments, our proposed ensembling classifier is the best performing classifier with the sensitivity, specificity, false omission rate, diagnostic odds ratio, and AUC as 0.789, 0.934, 0.092, 66.234, and 0.950 respectively which outperforms the state-of-the-art results by 2.00 & x0025; in AUC. Our proposed framework for the diabetes prediction outperforms the other methods discussed in the article. It can also provide better results on the same dataset which can lead to better performance in diabetes prediction. Our source code for diabetes prediction is made publicly available.
引用
收藏
页码:76516 / 76531
页数:16
相关论文
共 49 条
  • [1] Predicting Interactions between Virus and Host Proteins Using Repeat Patterns and Composition of Amino Acids
    Alguwaizani, Saud
    Park, Byungkyu
    Zhou, Xiang
    Huang, De-Shuang
    Han, Kyungsook
    [J]. JOURNAL OF HEALTHCARE ENGINEERING, 2018, 2018
  • [2] [Anonymous], 2018, APPL COMPUT INFORM
  • [3] A survey of cross-validation procedures for model selection
    Arlot, Sylvain
    Celisse, Alain
    [J]. STATISTICS SURVEYS, 2010, 4 : 40 - 79
  • [4] Bansal R, 2016, 2016 6TH INTERNATIONAL CONFERENCE - CLOUD SYSTEM AND BIG DATA ENGINEERING (CONFLUENCE), P373, DOI 10.1109/CONFLUENCE.2016.7508146
  • [5] IntelliHealth: A medical decision support application using a novel weighted multi-layer classifier ensemble framework
    Bashir, Saba
    Qamar, Usman
    Khan, Farhan Hassan
    [J]. JOURNAL OF BIOMEDICAL INFORMATICS, 2016, 59 : 185 - 200
  • [6] Gaussian process for nonstationary time series prediction
    Brahim-Belhouari, S
    Bermak, A
    [J]. COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2004, 47 (04) : 705 - 712
  • [7] Random forests
    Breiman, L
    [J]. MACHINE LEARNING, 2001, 45 (01) : 5 - 32
  • [8] Chatrati SP, 2020, J King Saud Univ-Comput Inf Sci
  • [9] XGBoost: A Scalable Tree Boosting System
    Chen, Tianqi
    Guestrin, Carlos
    [J]. KDD'16: PROCEEDINGS OF THE 22ND ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2016, : 785 - 794
  • [10] IDF Diabetes Atlas: Global estimates of diabetes prevalence for 2017 and projections for 2045
    Cho, N. H.
    Shaw, J. E.
    Karuranga, S.
    Huang, Y.
    Fernandes, J. D. da Rocha
    Ohlrogge, A. W.
    Malanda, B.
    [J]. DIABETES RESEARCH AND CLINICAL PRACTICE, 2018, 138 : 271 - 281