On the effectiveness of preprocessing methods when dealing with different levels of class imbalance

被引:251
作者
Garcia, V. [1 ]
Sanchez, J. S. [1 ]
Mollineda, R. A. [1 ]
机构
[1] Univ Jaume 1, Dept Llenguatges & Sistemes Informat, Inst New Imaging Technol, Castellon de La Plana 12071, Spain
关键词
Imbalance; Resampling; Classification; Performance measures; Multi-dimensional scaling; NEAREST-NEIGHBOR; CLASSIFICATION;
D O I
10.1016/j.knosys.2011.06.013
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The present paper investigates the influence of both the imbalance ratio and the classifier on the performance of several resampling strategies to deal with imbalanced data sets. The study focuses on evaluating how learning is affected when different resampling algorithms transform the originally imbalanced data into artificially balanced class distributions. Experiments over 17 real data sets using eight different classifiers, four resampling algorithms and four performance evaluation measures show that over-sampling the minority class consistently outperforms under-sampling the majority class when data sets are strongly imbalanced, whereas there are not significant differences for databases with a low imbalance. Results also indicate that the classifier has a very poor influence on the effectiveness of the resampling strategies. (C) 2011 Elsevier B.V. All rights reserved.
引用
收藏
页码:13 / 21
页数:9
相关论文
共 56 条
[11]   Distributed data mining in credit card fraud detection [J].
Chan, PK ;
Fan, W ;
Prodromidis, AL ;
Stolfo, SJ .
IEEE INTELLIGENT SYSTEMS & THEIR APPLICATIONS, 1999, 14 (06) :67-74
[12]   SMOTE: Synthetic minority over-sampling technique [J].
Chawla, Nitesh V. ;
Bowyer, Kevin W. ;
Hall, Lawrence O. ;
Kegelmeyer, W. Philip .
2002, American Association for Artificial Intelligence (16)
[13]   SMOTEBoost: Improving prediction of the minority class in boosting [J].
Chawla, NV ;
Lazarevic, A ;
Hall, LO ;
Bowyer, KW .
KNOWLEDGE DISCOVERY IN DATABASES: PKDD 2003, PROCEEDINGS, 2003, 2838 :107-119
[14]  
Chen XW, 2005, IEEE IJCNN, P1883
[15]   Learning from imbalanced data in surveillance of nosocomial infection [J].
Cohen, Gilles ;
Hilario, Melanie ;
Sax, Hugo ;
Hugonnet, Stephane ;
Geissbuhler, Antoine .
ARTIFICIAL INTELLIGENCE IN MEDICINE, 2006, 37 (01) :7-18
[16]   Evaluation of classifiers for an uneven class distribution problem [J].
Daskalaki, S ;
Kopanas, I ;
Avouris, N .
APPLIED ARTIFICIAL INTELLIGENCE, 2006, 20 (05) :381-417
[17]   A multiple resampling method for learning from imbalanced data sets [J].
Estabrooks, A ;
Jo, TH ;
Japkowicz, N .
COMPUTATIONAL INTELLIGENCE, 2004, 20 (01) :18-36
[18]   Adaptive fraud detection [J].
Fawcett, T ;
Provost, F .
DATA MINING AND KNOWLEDGE DISCOVERY, 1997, 1 (03) :291-316
[19]   Evolutionary-based selection of generalized instances for imbalanced classification [J].
Garcia, Salvador ;
Derrac, Joaquin ;
Triguero, Isaac ;
Carmona, Cristobal J. ;
Herrera, Francisco .
KNOWLEDGE-BASED SYSTEMS, 2012, 25 (01) :3-12
[20]   Evolutionary Undersampling for Classification with Imbalanced Datasets: Proposals and Taxonomy [J].
Garcia, Salvador ;
Herrera, Francisco .
EVOLUTIONARY COMPUTATION, 2009, 17 (03) :275-306