An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics

被引:1100
作者
Lopez, Victoria [1 ]
Fernandez, Alberto [2 ]
Garcia, Salvador [2 ]
Palade, Vasile [3 ]
Herrera, Francisco [1 ]
机构
[1] Univ Granada, CITIC UGR Res Ctr Informat & Commun Technol, Dept Comp Sci & Artificial Intelligence, Granada, Spain
[2] Univ Jaen, Dept Comp Sci, Jaen, Spain
[3] Univ Oxford, Dept Comp Sci, Oxford OX1 3QD, England
关键词
Imbalanced dataset; Sampling; Cost-sensitive learning; Small disjuncts; Noisy data; Dataset shift; SUPPORT VECTOR MACHINES; IMPROVING CLASSIFICATION; FEATURE-SELECTION; SAMPLING APPROACH; COVARIATE SHIFT; NEURAL-NETWORKS; MINORITY CLASS; SOFTWARE TOOL; DECISION TREE; DATA-SETS;
D O I
10.1016/j.ins.2013.07.007
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Training classifiers with datasets which suffer of imbalanced class distributions is an important problem in data mining. This issue occurs when the number of examples representing the class of interest is much lower than the ones of the other classes. Its presence in many real-world applications has brought along a growth of attention from researchers. We shortly review the many issues in machine learning and applications of this problem, by introducing the characteristics of the imbalanced dataset scenario in classification, presenting the specific metrics for evaluating performance in class imbalanced learning and enumerating the proposed solutions. In particular, we will describe preprocessing, cost-sensitive learning and ensemble techniques, carrying out an experimental study to contrast these approaches in an intra and inter-family comparison. We will carry out a thorough discussion on the main issues related to using data intrinsic characteristics in this classification problem. This will help to improve the current models with respect to: the presence of small disjuncts, the lack of density in the training data, the overlapping between classes, the identification of noisy data, the significance of the borderline instances, and the dataset shift between the training and the test distributions. Finally, we introduce several approaches and recommendations to address these problems in conjunction with imbalanced data, and we will show some experimental examples on the behavior of the learning algorithms on data with such intrinsic characteristics. (C) 2013 Elsevier Inc. All rights reserved.
引用
收藏
页码:113 / 141
页数:29
相关论文
共 152 条
[1]  
Alaiz-Rodríguez R, 2008, LECT NOTES ARTIF INT, V5032, P13
[2]  
Alaiz-Rodríguez R, 2009, LECT NOTES COMPUT SC, V5517, P122, DOI 10.1007/978-3-642-02478-8_16
[3]   KEEL: a software tool to assess evolutionary algorithms for data mining problems [J].
Alcala-Fdez, J. ;
Sanchez, L. ;
Garcia, S. ;
del Jesus, M. J. ;
Ventura, S. ;
Garrell, J. M. ;
Otero, J. ;
Romero, C. ;
Bacardit, J. ;
Rivas, V. M. ;
Fernandez, J. C. ;
Herrera, F. .
SOFT COMPUTING, 2009, 13 (03) :307-318
[4]  
Alcalá-Fdez J, 2011, J MULT-VALUED LOG S, V17, P255
[5]   An approach for classification of highly imbalanced data using weighting and undersampling [J].
Anand, Ashish ;
Pugalenthi, Ganesan ;
Fogel, Gary B. ;
Suganthan, P. N. .
AMINO ACIDS, 2010, 39 (05) :1385-1391
[6]  
[Anonymous], P 21 INT C MACH LEAR
[7]  
[Anonymous], P 10 EUR C MACH LEAR
[8]  
[Anonymous], KNOWLEDGE I IN PRESS
[9]  
[Anonymous], 2014, C4. 5: programs for machine learning
[10]  
[Anonymous], 2007, P 45 ANN SE REG C, DOI DOI 10.1145/1233341.1233378