Efficient treatment of outliers and class imbalance for diabetes prediction

被引:67
作者
Nnamoko, Nonso [1 ]
Korkontzelos, Ioannis [1 ]
机构
[1] Edge Hill Univ, Dept Comp Sci, Ormskirk, England
关键词
Outlier detection; Imbalanced data; Machine learning; Data preprocessing; Oversampling; SMOTE; LIFE-STYLE INTERVENTION; CLASSIFICATION; DIAGNOSIS; ARTMAP; IDENTIFICATION; PREVENTION; DISEASE; SYSTEM;
D O I
10.1016/j.artmed.2020.101815
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Learning from outliers and imbalanced data remains one of the major difficulties for machine learning classifiers. Among the numerous techniques dedicated to tackle this problem, data preprocessing solutions are known to be efficient and easy to implement. In this paper, we propose a selective data preprocessing approach that embeds knowledge of the outlier instances into artificially generated subset to achieve an even distribution. The Synthetic Minority Oversampling TEchnique (SMOTE) was used to balance the training data by introducing artificial minority instances. However, this was not before the outliers were identified and oversampled (irrespective of class). The aim is to balance the training dataset while controlling the effect of outliers. The experiments prove that such selective oversampling empowers SMOTE, ultimately leading to improved classification performance.
引用
收藏
页数:12
相关论文
共 61 条
[31]   Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning [J].
Han, H ;
Wang, WY ;
Mao, BH .
ADVANCES IN INTELLIGENT COMPUTING, PT 1, PROCEEDINGS, 2005, 3644 :878-887
[32]   The 10-Year Cost-Effectiveness of Lifestyle Intervention or Metformin for Diabetes Prevention An intent-to-treat analysis of the DPP/DPPOS [J].
Herman, William H. ;
Edelstein, Sharon L. ;
Ratner, Robert E. ;
Montez, Maria G. ;
Ackermann, Ronald T. ;
Orchard, Trevor J. ;
Foulkes, Mary A. ;
Zhang, Ping ;
Saudek, Christopher D. ;
Brown, Morton B. .
DIABETES CARE, 2012, 35 (04) :723-730
[33]   Identifying undiagnosed diabetes: cross-sectional survey of 3.6 million patients' electronic records [J].
Holt, Tim A. ;
Stables, David ;
Hippisley-Cox, Julia ;
O'Hanlon, Shaun ;
Majeed, Azeem .
BRITISH JOURNAL OF GENERAL PRACTICE, 2008, 58 (548) :192-196
[34]  
Holt Tim A, 2014, CMAJ Open, V2, pE248, DOI 10.9778/cmajo.20130095
[35]  
IBA W, 1992, MACHINE LEARNING /, P233
[36]  
John G. H., 1995, Uncertainty in Artificial Intelligence. Proceedings of the Eleventh Conference (1995), P338
[37]  
Kaneda Yuya, 2015, Journal of Information Processing, V23, P497
[38]   Machine Learning and Data Mining Methods in Diabetes Research [J].
Kavakiotis, Ioannis ;
Tsave, Olga ;
Salifoglou, Athanasios ;
Maglaveras, Nicos ;
Vlahavas, Ioannis ;
Chouvarda, Ioanna .
COMPUTATIONAL AND STRUCTURAL BIOTECHNOLOGY JOURNAL, 2017, 15 :104-116
[39]  
Kayaer K., 2003, P INT C ARTIFICIAL N, V26-29, P181
[40]  
Knowler William C, 2002, N Engl J Med, V346, P393, DOI 10.1056/NEJMoa012512