SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary

被引:1198
作者
Fernandez, Alberto [1 ]
Garcia, Salvador [1 ]
Herrera, Francisco [1 ]
Chawla, Nitesh V. [2 ,3 ]
机构
[1] Univ Granada, Dept Comp Sci & Artificial Intelligence, Granada, Spain
[2] Univ Notre Dame, Dept Comp Sci & Engn, Notre Dame, IN 46556 USA
[3] Univ Notre Dame, Interdisciplinary Ctr Network Sci & Applicat, Notre Dame, IN 46556 USA
基金
美国国家科学基金会;
关键词
OVER-SAMPLING APPROACH; FEATURE-SELECTION; BIG DATA; DATA-SETS; SVM CLASSIFICATION; DATA GENERATION; MINORITY CLASS; ALGORITHM; FRAMEWORK; PERFORMANCE;
D O I
10.1613/jair.1.11192
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The Synthetic Minority Oversampling Technique (SMOTE) preprocessing algorithm is considered "de facto" standard in the framework of learning from imbalanced data. This is due to its simplicity in the design of the procedure, as well as its robustness when applied to different type of problems. Since its publication in 2002, SMOTE has proven successful in a variety of applications from several different domains. SMOTE has also inspired several approaches to counter the issue of class imbalance, and has also significantly contributed to new supervised learning paradigms, including multilabel classification, incremental learning, semi-supervised learning, multi-instance learning, among others. It is standard benchmark for learning from imbalanced data. It is also featured in a number of different software packages - from open source to commercial. In this paper, marking the fifteen year anniversary of SMOTE, we reflect on the SMOTE journey, discuss the current state of affairs with SMOTE, its applications, and also identify the next set of challenges to extend SMOTE for Big Data problems.
引用
收藏
页码:863 / 905
页数:43
相关论文
共 237 条
[61]   A study of the behaviour of linguistic fuzzy rule based classification systems in the framework of imbalanced data-sets [J].
Fernandez, Alberto ;
Garcia, Salvador ;
Jose del Jesus, Maria ;
Herrera, Francisco .
FUZZY SETS AND SYSTEMS, 2008, 159 (18) :2378-2398
[62]   A Pareto-based Ensemble with Feature and Instance Selection for Learning from Multi-Class Imbalanced Datasets [J].
Fernandez, Alberto ;
Jose Carmona, Cristobal ;
Jose del Jesus, Maria ;
Herrera, Francisco .
INTERNATIONAL JOURNAL OF NEURAL SYSTEMS, 2017, 27 (06)
[63]   An insight into imbalanced Big Data classification: outcomes and challenges [J].
Fernandez, Alberto ;
del Rio, Sara ;
Chawla, Nitesh V. ;
Herrera, Francisco .
COMPLEX & INTELLIGENT SYSTEMS, 2017, 3 (02) :105-120
[64]   Big Data with Cloud Computing: an insight on the computing environment, MapReduce, and programming frameworks [J].
Fernandez, Alberto ;
del Rio, Sara ;
Lopez, Victoria ;
Bawakid, Abdullah ;
del Jesus, Maria J. ;
Benitez, Jose M. ;
Herrera, Francisco .
WILEY INTERDISCIPLINARY REVIEWS-DATA MINING AND KNOWLEDGE DISCOVERY, 2014, 4 (05) :380-409
[65]   Analysing the classification of imbalanced data-sets with multiple classes: Binarization techniques and ad-hoc approaches [J].
Fernandez, Alberto ;
Lopez, Victoria ;
Galar, Mikel ;
Jose del Jesus, Maria ;
Herrera, Francisco .
KNOWLEDGE-BASED SYSTEMS, 2013, 42 :97-110
[66]   Genetics-Based Machine Learning for Rule Induction: State of the Art, Taxonomy, and Comparative Study [J].
Fernandez, Alberto ;
Garcia, Salvador ;
Luengo, Julian ;
Bernado-Mansilla, Ester ;
Herrera, Francisco .
IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, 2010, 14 (06) :913-941
[67]   A dynamic over-sampling procedure based on sensitivity for multi-class problems [J].
Fernandez-Navarro, Francisco ;
Hervas-Martinez, Cesar ;
Antonio Gutierrez, Pedro .
PATTERN RECOGNITION, 2011, 44 (08) :1821-1833
[68]  
Frank E, 2006, LECT NOTES ARTIF INT, V3918, P97
[69]   A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches [J].
Galar, Mikel ;
Fernandez, Alberto ;
Barrenechea, Edurne ;
Bustince, Humberto ;
Herrera, Francisco .
IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS PART C-APPLICATIONS AND REVIEWS, 2012, 42 (04) :463-484
[70]   Ordering-based pruning for improving the performance of ensembles of classifiers in the framework of imbalanced datasets [J].
Galar, Mikel ;
Fernandez, Alberto ;
Barrenechea, Edurne ;
Bustince, Humberto ;
Herrera, Francisco .
INFORMATION SCIENCES, 2016, 354 :178-196