Neighbourhood sampling in bagging for imbalanced data

被引:148
作者
Blaszczynski, Jerzy [1 ]
Stefanowski, Jerzy [1 ]
机构
[1] Poznan Univ Tech, Inst Comp Sci, PL-60965 Poznan, Poland
关键词
Class imbalance; Ensemble classifiers; Bagging; IDENTIFICATION; CLASSIFICATION;
D O I
10.1016/j.neucom.2014.07.064
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Various approaches to extend bagging ensembles for class imbalanced data are considered. First, we review known extensions and compare them in a comprehensive experimental study. The results show that integrating bagging with under-sampling is more powerful than over-sampling. They also allow to distinguish Roughly Balanced Bagging as the most accurate extension. Then, we point out that complex and difficult distribution of the minority class can be handled by analyzing the content of a neighbourhood of examples. In our study we show that taking into account such local characteristics of the minority class distribution can be useful both for analyzing performance of ensembles with respect to data difficulty factors and for proposing new generalizations of bagging. We demonstrate it by proposing Neighbourhood Balanced Bagging, where sampling probabilities of examples are modified according to the class distribution in their neighbourhood. Two of its versions are considered: the first one keeping a larger size of bootstrap samples by hybrid over-sampling and the other reducing this size with stronger under-sampling. Experiments prove that the first version is significantly better than existing over-sampling bagging extensions while the other version is competitive to Roughly Balanced Bagging. Finally, we demonstrate that detecting types of minority examples depending on their neighbourhood may help explain why some ensembles work better for imbalanced data than others. (C) 2014 Elsevier B.V. All rights reserved.
引用
收藏
页码:529 / 542
页数:14
相关论文
共 41 条
[1]  
[Anonymous], 2004, ACM Sigkdd Explorations Newsletter
[2]  
[Anonymous], 2004, TECHNICAL REPORT
[3]  
[Anonymous], 1997, 14th International Conference on Machine Learning
[4]  
[Anonymous], 2011, Evaluating Learning Algorithms: A Classification Perspective, DOI DOI 10.1017/CBO9780511921803
[5]  
Anyfamis D, 2007, INT FED INFO PROC, P21
[6]  
Anyfantis D., 2008, P IEEE INT C DISTR H
[7]  
Asuncion Arthur, 2007, UCI machine learning repository
[8]  
Batista GEAPA, 2004, Sigkdd Explorations, V6, P20, DOI [10.1145/1007730.1007735, DOI 10.1145/1007730.1007735, 10.1145/1007730.1007735.2]
[9]  
Batista GEAPA., 2009, ARGENTINE S ARTIFICI, P1, DOI DOI 10.1145/1553374.1553495
[10]  
Blaszczynski J., 2009, LOCAL PATTERNS GLOAL, P19