Efficient treatment of outliers and class imbalance for diabetes prediction

被引:67
作者
Nnamoko, Nonso [1 ]
Korkontzelos, Ioannis [1 ]
机构
[1] Edge Hill Univ, Dept Comp Sci, Ormskirk, England
关键词
Outlier detection; Imbalanced data; Machine learning; Data preprocessing; Oversampling; SMOTE; LIFE-STYLE INTERVENTION; CLASSIFICATION; DIAGNOSIS; ARTMAP; IDENTIFICATION; PREVENTION; DISEASE; SYSTEM;
D O I
10.1016/j.artmed.2020.101815
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Learning from outliers and imbalanced data remains one of the major difficulties for machine learning classifiers. Among the numerous techniques dedicated to tackle this problem, data preprocessing solutions are known to be efficient and easy to implement. In this paper, we propose a selective data preprocessing approach that embeds knowledge of the outlier instances into artificially generated subset to achieve an even distribution. The Synthetic Minority Oversampling TEchnique (SMOTE) was used to balance the training data by introducing artificial minority instances. However, this was not before the outliers were identified and oversampled (irrespective of class). The aim is to balance the training dataset while controlling the effect of outliers. The experiments prove that such selective oversampling empowers SMOTE, ultimately leading to improved classification performance.
引用
收藏
页数:12
相关论文
共 61 条
[1]  
[Anonymous], NUMB PEOPL DIAB 60 P
[2]  
[Anonymous], PIMA DIABETES
[3]  
[Anonymous], ENCY DISTANCES
[4]  
[Anonymous], ARXIV191106965
[5]  
[Anonymous], 1994, ELLIS HORWOOD SERIES
[6]  
[Anonymous], DIAB PREV
[7]  
[Anonymous], 2015, DIAB FACTS STATS
[8]   Undiagnosed diabetes from cross-sectional GP practice data: an approach to identify communities with high likelihood of undiagnosed diabetes [J].
Bagheri, Nasser ;
McRae, Ian ;
Konings, Paul ;
Butler, Danielle ;
Douglas, Kirsty ;
Del Fante, Peter ;
Adams, Robert .
BMJ OPEN, 2014, 4 (07)
[9]   MWMOTE-Majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning [J].
Barua, Sukarna ;
Islam, Md. Monirul ;
Yao, Xin ;
Murase, Kazuyuki .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2014, 26 (02) :405-425
[10]   10-year follow-up of diabetes incidence and weight loss in the Diabetes Prevention Program Outcomes Study [J].
Bray, G. A. ;
Chatellier, A. ;
Duncan, C. ;
Greenway, F. L. ;
Levy, E. ;
Ryan, D. H. ;
Polonsky, K. S. ;
Tobian, J. ;
Ehrmann, D. ;
Matulik, M. J. ;
Clark, B. ;
Czech, K. ;
DeSandre, C. ;
Hilbrich, R. ;
McNabb, W. ;
Semenske, A. R. ;
Goldstein, B. J. ;
Smith, K. A. ;
Wildman, W. ;
Pepe, C. ;
Goldberg, R. B. ;
Calles, J. ;
Ojito, J. ;
Castillo-Florez, S. ;
Florez, H. J. ;
Giannella, A. ;
Lara, O. ;
Veciana, B. ;
Haffner, S. M. ;
Montez, M. G. ;
Lorenzo, C. ;
Martinez, A. ;
Hamman, R. F. ;
Testaverde, L. ;
Bouffard, A. ;
Dabelea, D. ;
Jenkins, T. ;
Lenz, D. ;
Perreault, L. ;
Price, D. W. ;
Steinke, S. C. ;
Horton, E. S. ;
Poirier, C. S. ;
Swift, K. ;
Caballero, E. ;
Jackson, S. D. ;
Lambert, L. ;
Lawton, K. E. ;
Ledbury, S. ;
Kahn, S. E. .
LANCET, 2009, 374 (9702) :1677-1686