A method for resampling imbalanced datasets in binary classification tasks for real-world problems

被引:120
作者
Cateni, Silvia [1 ]
Colla, Valentina [1 ]
Vannucci, Marco [1 ]
机构
[1] Scuola Super StAnna, TeCIP Inst, Pisa, Italy
关键词
Oversampling; Undersampling; Imbalanced dataset;
D O I
10.1016/j.neucom.2013.05.059
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The paper presents a novel resampling method for binary classification problems on imbalanced datasets. Imbalanced datasets are frequently found in many industrial applications: for instance, the occurrence of particular product defects, the diagnosis of severe diseases in a series of patients or machine faults are rare events whose detection is of utmost importance. In this paper a new resampling method is proposed combining an oversampling and an undersampling technique. Several tests have been developed aiming at assessing the efficiency of the proposed method. Four classifiers based, respectively, on Support Vector Machine, Decision Tree, labelled Self-Organizing Map and Bayesian Classifiers have been developed and applied for binary classification on the following four datasets: a synthetic dataset, a widely used public dataset and two datasets coming from industrial applications. The results that have been obtained in the tests are presented and discussed in the paper; in particular, the performances that are achieved by the four classifiers through the proposed novel resampling approach have been compared to the ones that are obtained, without any resampling, through a widely applied and well known resampling technique, i.e. the classical SMOTE approach, and through another approach coupling informed SMOTE-based oversampling and informed clustering-based undersampling. (C) 2014 Elsevier B.V. All rights reserved.
引用
收藏
页码:32 / 41
页数:10
相关论文
共 36 条
[1]  
[Anonymous], 1984, OLSHEN STONE CLASSIF, DOI 10.2307/2530946
[2]  
[Anonymous], 1994, P MACH LEARN P
[3]  
[Anonymous], 1997, P 14 INT C ONMACHINE
[4]  
[Anonymous], 2003, C4 5 IMBALANCED DATA
[5]   Strategies for learning in class imbalance problems [J].
Barandela, R ;
Sánchez, JS ;
García, V ;
Rangel, E .
PATTERN RECOGNITION, 2003, 36 (03) :849-851
[6]  
Batuwita R, 2010, IEEE IJCNN
[7]   A multivariate fuzzy system applied for outliers detection [J].
Cateni, Silvia ;
Colla, Valentina ;
Nastasi, Gianluca .
JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2013, 24 (04) :889-903
[8]   SMOTE: Synthetic minority over-sampling technique [J].
Chawla, Nitesh V. ;
Bowyer, Kevin W. ;
Hall, Lawrence O. ;
Kegelmeyer, W. Philip .
2002, American Association for Artificial Intelligence (16)
[9]   A Novel Differential Evolution-Clustering Hybrid Resampling Algorithm on Imbalanced Datasets [J].
Chen, Leichen ;
Cai, Zhihua ;
Chen, Lu ;
Gu, Qiong .
THIRD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING: WKDD 2010, PROCEEDINGS, 2010, :81-85
[10]  
Cohen W. W., 1995, Machine Learning. Proceedings of the Twelfth International Conference on Machine Learning, P115