Cluster-based Under-sampling with Random Forest for Multi-Class Imbalanced Classification

被引:0
作者
Arafat, Md. Yasir [1 ]
Hoque, Sabera [1 ]
Farid, Dewan Md. [1 ]
机构
[1] United Int Univ, Dept Comp Sci & Engn, Dhaka, Bangladesh
来源
2017 11TH INTERNATIONAL CONFERENCE ON SOFTWARE, KNOWLEDGE, INFORMATION MANAGEMENT AND APPLICATIONS (SKIMA) | 2017年
关键词
AdaBoost; Imbalanced Data; Random Forest; RUS-Boost; SMOTEBoost; DATA-SETS;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Multi-class imbalanced classification has emerged as a very challenging research area in machine learning for data mining applications. It occurs when the number of training instances representing majority class instances is much higher than that of minority class instances. Existing machine learning algorithms provide a good accuracy when classifying majority class instances, but ignore/misclassify the minority class instances. However, the minority class instances hold the most vital information and misclassifying them can lead to serious problems. Several sampling techniques with ensemble learning have been proposed for binary-class imbalanced classification in the last decade. In this paper, we propose a new ensemble learning technique by employing cluster-based under-sampling with random forest algorithm for dealing with multi-class highly imbalanced data classification. The proposed approach cluster the majority class instances and then select the most informative majority class instances in each cluster to form several balanced datasets. After that random forest algorithm is applied on balanced datasets and applied majority voting technique to classify test/new instances. We tested the performance of our proposed method with existing popular sampling with boosting methods like: AdaBoost, RUSBoost, and SMOTEBoost on 13 benchmark imbalanced datasets. The experimental results show that the proposed cluster-based under-sampling with random forest technique achieved high accuracy for classifying both majority and minority class instances in compare with existing methods.
引用
收藏
页数:6
相关论文
共 25 条
[1]  
Afza A. A., 2011, WORLD COMPUTER SCI I, V1, P105
[2]  
Alcalá-Fdez J, 2011, J MULT-VALUED LOG S, V17, P255
[3]   Classifying imbalanced data sets using similarity based hierarchical decomposition [J].
Beyan, Cigdem ;
Fisher, Robert .
PATTERN RECOGNITION, 2015, 48 (05) :1653-1672
[4]   SMOTE for high-dimensional class-imbalanced data [J].
Blagus, Rok ;
Lusa, Lara .
BMC BIOINFORMATICS, 2013, 14
[5]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[6]   SMOTE: Synthetic minority over-sampling technique [J].
Chawla, Nitesh V. ;
Bowyer, Kevin W. ;
Hall, Lawrence O. ;
Kegelmeyer, W. Philip .
2002, American Association for Artificial Intelligence (16)
[7]   SMOTEBoost: Improving prediction of the minority class in boosting [J].
Chawla, NV ;
Lazarevic, A ;
Hall, LO ;
Bowyer, KW .
KNOWLEDGE DISCOVERY IN DATABASES: PKDD 2003, PROCEEDINGS, 2003, 2838 :107-119
[8]   When is Undersampling Effective in Unbalanced Classification Tasks? [J].
Dal Pozzolo, Andrea ;
Caelen, Olivier ;
Bontempi, Gianluca .
MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES, ECML PKDD 2015, PT I, 2015, 9284 :200-215
[9]  
Farid D., 2010, INT C DAT MIN KNOWL, P186
[10]  
Farid D.M., 2016, 25 BELG DUTCH C MACH, P1