Random Balance: Ensembles of variable priors classifiers for imbalanced data

被引:168
作者
Diez-Pastor, Jose F. [1 ]
Rodriguez, Juan J. [1 ]
Garcia-Osorio, Cesar [1 ]
Kuncheva, Ludmila I. [2 ]
机构
[1] Escuela Politecn Super, Lenguajes & Sistemas Informat, Burgos 09006, Spain
[2] Bangor Univ, Sch Comp Sci, Bangor LL57 1UT, Gwynedd, Wales
关键词
Classifier ensembles; Imbalanced data sets; Bagging; AdaBoost; SMOTE; Undersampling; DECISION TREES; CLASSIFICATION; SMOTE; TESTS;
D O I
10.1016/j.knosys.2015.04.022
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In Machine Learning, a data set is imbalanced when the class proportions are highly skewed. Imbalanced data sets arise routinely in many application domains and pose a challenge to traditional classifiers. We propose a new approach to building ensembles of classifiers for two-class imbalanced data sets, called Random Balance. Each member of the Random Balance ensemble is trained with data sampled from the training set and augmented by artificial instances obtained using SMOTE. The novelty in the approach is that the proportions of the classes for each ensemble member are chosen randomly. The intuition behind the method is that the proposed diversity heuristic will ensure that the ensemble contains classifiers that are specialized for different operating points on the ROC space, thereby leading to larger AUC compared to other ensembles of classifiers. Experiments have been carried out to test the Random Balance approach by itself, and also in combination with standard ensemble methods. As a result, we propose a new ensemble creation method called RB-Boost which combines Random Balance with AdaBoost.M2. This combination involves enforcing random class proportions in addition to instance re-weighting. Experiments with 86 imbalanced data sets from two well known repositories demonstrate the advantage of the Random Balance approach. (C) 2015 Elsevier B.V. All rights reserved.
引用
收藏
页码:96 / 111
页数:16
相关论文
共 60 条
[1]  
Alcalá-Fdez J, 2011, J MULT-VALUED LOG S, V17, P255
[2]  
[Anonymous], 2010, UCI Machine Learning Repository
[3]   New applications of ensembles of classifiers [J].
Barandela, R ;
Sánchez, JS ;
Valdovinos, RM .
PATTERN ANALYSIS AND APPLICATIONS, 2003, 6 (03) :245-256
[4]  
Batista GEAPA, 2004, Sigkdd Explorations, V6, P20, DOI [10.1145/1007730.1007735, DOI 10.1145/1007730.1007735, 10.1145/1007730.1007735.2]
[5]   microPred: effective classification of pre-miRNAs for human miRNA gene prediction [J].
Batuwita, Rukshan ;
Palade, Vasile .
BIOINFORMATICS, 2009, 25 (08) :989-995
[6]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[7]   Identifying mislabeled training data [J].
Brodley, CE ;
Friedl, MA .
JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 1999, 11 :131-167
[8]  
Bunkhumpornpat C, 2009, LECT NOTES ARTIF INT, V5476, P475, DOI 10.1007/978-3-642-01307-2_43
[9]   SMOTE: Synthetic minority over-sampling technique [J].
Chawla, Nitesh V. ;
Bowyer, Kevin W. ;
Hall, Lawrence O. ;
Kegelmeyer, W. Philip .
2002, American Association for Artificial Intelligence (16)
[10]   SMOTEBoost: Improving prediction of the minority class in boosting [J].
Chawla, NV ;
Lazarevic, A ;
Hall, LO ;
Bowyer, KW .
KNOWLEDGE DISCOVERY IN DATABASES: PKDD 2003, PROCEEDINGS, 2003, 2838 :107-119