A Comprehensive Analysis of Synthetic Minority Oversampling Technique (SMOTE) for handling class imbalance

被引:340
作者
Elreedy, Dina [1 ]
Atiya, Amir F. [1 ]
机构
[1] Cairo Univ, Comp Engn Dept, Giza, Egypt
关键词
Unbalanced data; Minority class; Over-sampling; Data level; SMOTE; DATA SETS; CLASSIFICATION;
D O I
10.1016/j.ins.2019.07.070
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Imbalanced classification problems are often encountered in many applications. The challenge is that there is a minority class that has typically very little data and is often the focus of attention. One approach for handling imbalance is to generate extra data from the minority class, to overcome its shortage of data. The Synthetic Minority over-sampling TEchnique (SMOTE) is one of the dominant methods in the literature that achieves this extra sample generation. It is based on generating examples on the lines connecting a point and one its K-nearest neighbors. This paper presents a theoretical and experimental analysis of the SMOTE method. We explore the accuracy of how faithful it emulates the underlying density. To our knowledge, this is the first mathematical analysis of the SMOTE method. Moreover, we analyze the effect of the different factors on generation accuracy, such as the dimension, size of the training set and the considered number of neighbors K. We also provide a qualitative analysis that examines the factors affecting its accuracy. In addition, we explore the impact of SMOTE on classification boundary, and classification performance. (C) 2019 Elsevier Inc. All rights reserved.
引用
收藏
页码:32 / 64
页数:33
相关论文
共 44 条
[1]   Applying support vector machines to imbalanced datasets [J].
Akbani, R ;
Kwek, S ;
Japkowicz, N .
MACHINE LEARNING: ECML 2004, PROCEEDINGS, 2004, 3201 :39-50
[2]  
[Anonymous], 2004, ACM SIGKDD EXPLORATI, DOI DOI 10.1145/1007730.1007737
[3]  
Batista G. E. A. P. A., 2004, ACM SIGKDD Explor. Newsl, V6, P20, DOI [10.1145/1007730.1007735, DOI 10.1145/1007730.1007735]
[4]   Manifold-based synthetic oversampling with manifold conformance estimation [J].
Bellinger, Colin ;
Drummond, Christopher ;
Japkowicz, Nathalie .
MACHINE LEARNING, 2018, 107 (03) :605-637
[5]   A Bias and Variance Analysis for Multistep-Ahead Time Series Forecasting [J].
Ben Taieb, Souhaib ;
Atiya, Amir F. .
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2016, 27 (01) :62-76
[6]   Class prediction for high-dimensional class-imbalanced data [J].
Blagus, Rok ;
Lusa, Lara .
BMC BIOINFORMATICS, 2010, 11 :523
[7]  
Bunkhumpornpat C, 2009, LECT NOTES ARTIF INT, V5476, P475, DOI 10.1007/978-3-642-01307-2_43
[8]   Distributed data mining in credit card fraud detection [J].
Chan, PK ;
Fan, W ;
Prodromidis, AL ;
Stolfo, SJ .
IEEE INTELLIGENT SYSTEMS & THEIR APPLICATIONS, 1999, 14 (06) :67-74
[9]   SMOTE: Synthetic minority over-sampling technique [J].
Chawla, Nitesh V. ;
Bowyer, Kevin W. ;
Hall, Lawrence O. ;
Kegelmeyer, W. Philip .
2002, American Association for Artificial Intelligence (16)
[10]   SMOTEBoost: Improving prediction of the minority class in boosting [J].
Chawla, NV ;
Lazarevic, A ;
Hall, LO ;
Bowyer, KW .
KNOWLEDGE DISCOVERY IN DATABASES: PKDD 2003, PROCEEDINGS, 2003, 2838 :107-119