BPSO-Adaboost-KNN ensemble learning algorithm for multi-class imbalanced data classification

被引:131
作者
Guo Haixiang [1 ,2 ,3 ,4 ]
Li Yijing [1 ,2 ]
Li Yanan [1 ,2 ]
Liu Xiao [1 ,2 ]
Li Jinling [1 ]
机构
[1] China Univ Geosci, Coll Econ & Management, Wuhan 430074, Peoples R China
[2] China Univ Geosci, Res Ctr Digital Business Management, Wuhan 430074, Peoples R China
[3] China Univ Geosci, Mineral Resource Strategy & Policy Res Ctr, Wuhan 430074, Peoples R China
[4] Cent South Univ, Sch Business, Changsha 410083, Hunan, Peoples R China
关键词
Imbalanced data; Ensemble; Feature selection; Classification; Oil reservoir; FEATURE-SELECTION; DATA-SETS;
D O I
10.1016/j.engappai.2015.09.011
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This paper proposes an ensemble algorithm named of BPSO-Adaboost-KNN to cope with multi-class imbalanced data classification. The main idea of this algorithm is to integrate feature selection and boosting into ensemble. What's more, we utilize a novel evaluation metric called AUCarea which is especially for multi-class classification. In our model BPSO is employed as the feature selection algorithm in which AUCarea is chosen as the fitness. For classification, we generate a boosting classifier in which KNN is selected as the basic classifier. In order to verify the effectiveness of our method, 19 benchmarks are used in our experiments. The results show that the proposed algorithm improves both the stability and the accuracy of boosting after carrying out feature selection, and the performance of our algorithm is comparable with other state-of-the-art algorithms. In statistical analyses, we apply Bland-Altman analysis to show the consistencies between AUCarea and other popular metrics like average G-mean, average F-value etc. Besides, we use linear regression to find deeper correlation between AUCarea and other metrics in order to show why AUCarea works well in this issue. We also put out a series of statistical studies in order to analyze if there exist significant improvements after feature selection and boosting are employed. At last, the proposed algorithm is applied in oil-bearing of reservoir recognition. The classification precision is up to 99% in oilsk81-oilsk85 well logging data in Jianghan oilfield of China, which is 20% higher than KNN classifier. Particularly, the proposed algorithm has significant superiority when distinguishing the oil layer from other layers. (C) 2015 Elsevier Ltd. All rights reserved.
引用
收藏
页码:176 / 193
页数:18
相关论文
共 58 条
[1]  
Abu Shanab A, 2011, 2011 IEEE INTERNATIONAL CONFERENCE ON INFORMATION REUSE AND INTEGRATION (IRI), P234, DOI 10.1109/IRI.2011.6009552
[2]   DBFS: An effective Density Based Feature Selection scheme for small sample size and high dimensional imbalanced data sets [J].
Alibeigi, Mina ;
Hashemi, Sattar ;
Hamzeh, Ali .
DATA & KNOWLEDGE ENGINEERING, 2012, 81-82 :67-103
[3]  
[Anonymous], 1997IEEE
[4]  
[Anonymous], P 8 IEEE INT S APPL
[5]  
[Anonymous], PATTERN RECOGN LETT
[6]  
[Anonymous], 2004, ACM SIGKDD EXPLOR NE, DOI DOI 10.1145/1007730.1007733
[7]  
[Anonymous], 1999, Journal of Southwest Petroleum University
[8]  
[Anonymous], PATTERN RECOGN LETT
[9]  
[Anonymous], 2012, IEEE T SYST MAN CY C, DOI DOI 10.1109/TSMCC.2011.2161285
[10]  
[Anonymous], ICML 2003