Predicting the onset of type 2 diabetes using wide and deep learning with electronic health records

被引:79
作者
Nguyen, Binh P. [1 ]
Pham, Hung N. [2 ]
Tran, Hop [1 ]
Nghiem, Nhung [3 ]
Nguyen, Quang H. [2 ]
Do, Trang T. T. [4 ]
Cao Truong Tran [5 ]
Simpson, Colin R. [6 ,7 ]
机构
[1] Victoria Univ Wellington, Sch Math & Stat, Wellington 6140, New Zealand
[2] Hanoi Univ Sci & Technol, Sch Informat & Commun Technol, 1 Dai Co Viet Rd, Hanoi 100000, Vietnam
[3] Univ Otago, Dept Publ Hlth, 23A Mein St, Wellington 6021, New Zealand
[4] Agcy Sci Technol & Res, Inst Infocomm Res, 1 Fusionopolis Way, Singapore 138632, Singapore
[5] Le Quy Don Tech Univ, Fac Informat Technol, 236 Hoang Quoc Viet St, Hanoi 100000, Vietnam
[6] Victoria Univ Wellington, Fac Hlth, Wellington 6140, New Zealand
[7] Univ Edinburgh, Usher Inst, Edinburgh EH8 9AG, Midlothian, Scotland
关键词
Electronic health records; Incidence; Onset; Prediction; Type 2 diabetes mellitus; Wide and deep learning; POPULATION; MODELS;
D O I
10.1016/j.cmpb.2019.105055
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Objective: Diabetes is responsible for considerable morbidity, healthcare utilisation and mortality in both developed and developing countries. Currently, methods of treating diabetes are inadequate and costly so prevention becomes an important step in reducing the burden of diabetes and its complications. Electronic health records (EHRs) for each individual or a population have become important tools in understanding developing trends of diseases. Using EHRs to predict the onset of diabetes could improve the quality and efficiency of medical care. In this paper, we apply a wide and deep learning model that combines the strength of a generalised linear model with various features and a deep feed-forward neural network to improve the prediction of the onset of type 2 diabetes mellitus (T2DM). Materials and methods: The proposed method was implemented by training various models into a logistic loss function using a stochastic gradient descent. We applied this model using public hospital record data provided by the Practice Fusion EHRs for the United States population. The dataset consists of de-identified electronic health records for 9948 patients, of which 1904 have been diagnosed with T2DM. Prediction of diabetes in 2012 was based on data obtained from previous years (2009-2011). The imbalance class of the model was handled by Synthetic Minority Oversampling Technique (SMOTE) for each cross-validation training fold to analyse the performance when synthetic examples for the minority class are created. We used SMOTE of 150 and 30 0 percent, in which 300 percent means that three new synthetic instances are created for each minority class instance. This results in the approximated diabetes:non-diabetes distributions in the training set of 1:2 and 1:1, respectively. Results: Our final ensemble model not using SMOTE obtained an accuracy of 84.28%, area under the receiver operating characteristic curve (AUC) of 84.13%, sensitivity of 31.17% and specificity of 96.85%. Using SMOTE of 150 and 300 percent did not improve AUC (83.33% and 82.12%, respectively) but increased sensitivity (49.40% and 71.57%, respectively) with a moderate decrease in specificity (90.16% and 76.59%, respectively). Discussion and conclusions: Our algorithm has further optimised the prediction of diabetes onset using a novel state-of-the-art machine learning algorithm: the wide and deep learning neural network architecture. (C) 2019 Elsevier B.V. All rights reserved.
引用
收藏
页数:9
相关论文
共 34 条
[1]   Predicting diabetes mellitus using SMOTE and ensemble machine learning approach: The Henry Ford ExercIse Testing (FIT) project [J].
Alghamdi, Manal ;
Al-Mallah, Mouaz ;
Keteylan, Steven ;
Brawner, Clinton ;
Ehrman, Jonathan ;
Sakr, Sherif .
PLOS ONE, 2017, 12 (07)
[2]   Electronic health record phenotyping improves detection and screening of type 2 diabetes in the general United States population: A cross-sectional, unselected, retrospective study [J].
Anderson, Ariana E. ;
Kerr, Wesley T. ;
Thames, April ;
Li, Tong ;
Xiao, Jiayang ;
Cohen, Mark S. .
JOURNAL OF BIOMEDICAL INFORMATICS, 2016, 60 :162-168
[3]  
Anderson Jeffrey P, 2015, J Diabetes Sci Technol, V10, P6, DOI 10.1177/1932296815620200
[4]  
[Anonymous], 2016, DEEP LEARNING
[5]  
[Anonymous], 2016, P 1 WORKSH DEEP LEAR, DOI DOI 10.1145/2988450.2988454
[6]   Predicting diabetes-related hospitalizations based on electronic health records [J].
Brisimi, Theodora S. ;
Xu, Tingting ;
Wang, Taiyao ;
Dai, Wuyang ;
Paschalidis, Ioannis Ch .
STATISTICAL METHODS IN MEDICAL RESEARCH, 2019, 28 (12) :3667-3682
[7]   Using Electronic Health Records for Population Health Research: A Review of Methods and Applications [J].
Casey, Joan A. ;
Schwartz, Brian S. ;
Stewart, Walter F. ;
Adler, Nancy E. .
ANNUAL REVIEW OF PUBLIC HEALTH, VOL 37, 2016, 37 :61-81
[8]   SMOTE: Synthetic minority over-sampling technique [J].
Chawla, Nitesh V. ;
Bowyer, Kevin W. ;
Hall, Lawrence O. ;
Kegelmeyer, W. Philip .
2002, American Association for Artificial Intelligence (16)
[9]  
Choi E, 2016, ADV NEUR IN, V29
[10]   A comparison of models for predicting early hospital readmissions [J].
Futoma, Joseph ;
Morris, Jonathan ;
Lucas, Joseph .
JOURNAL OF BIOMEDICAL INFORMATICS, 2015, 56 :229-238