A design of information granule-based under-sampling method in imbalanced data classification

被引:0
作者
Tianyu Liu
Xiubin Zhu
Witold Pedrycz
Zhiwu Li
机构
[1] Xidian University,School of Electro
[2] University of Alberta,Mechanical Engineering
[3] Macau University of Science and Technology,Department of Electrical and Computer Engineering
[4] King Abdulaziz University,Institute of Systems Engineering
[5] Guilin University of Electronic Technology,Faculty of Engineering
来源
Soft Computing | 2020年 / 24卷
关键词
Imbalanced data; Information granule; Support vector machine (SVM); -nearest-neighbor (KNN); Under-sampling;
D O I
暂无
中图分类号
学科分类号
摘要
In numerous real-world problems, we are faced with difficulties in learning from imbalanced data. The classification performance of a “standard” classifier (learning algorithm) is evidently hindered by the imbalanced distribution of data. The over-sampling and under-sampling methods have been researched extensively with the aim to increase the predication accuracy over the minority class. However, traditional under-sampling methods tend to ignore important characteristics pertinent to the majority class. In this paper, a novel under-sampling method based on information granules is proposed. The method exploits the concepts and algorithms of granular computing. First, information granules are built around the selected patterns coming from the majority class to capture the essence of the data belonging to this class. In the sequel, the resultant information granules are evaluated in terms of their quality and those with the highest specificity values are selected. Next, the selected numeric data are augmented by some weights implied by the size of information granules. Finally, a support vector machine and a K-nearest-neighbor classifier, both being regarded here as representative classifiers, are built based on the weighted data. Experimental studies are carried out using synthetic data as well as a suite of imbalanced data sets coming from the public machine learning repositories. The experimental results quantify the performance of support vector machine and K-nearest-neighbor with under-sampling method based on information granules. The results demonstrate the superiority of the performance obtained for these classifiers endowed with conventional under-sampling method. In general, the improvement of performance expressed in terms of G-means is over 10% when applying information granule under-sampling compared with random under-sampling.
引用
收藏
页码:17333 / 17347
页数:14
相关论文
共 186 条
  • [1] Abualigah LMQ(2015)“Applying genetic algorithms to information retrieval using vector space model Int J Comput Sci Eng Appl IJCSEA 5 19-28
  • [2] Hanandeh ES(2017)Unsupervised text feature selection technique based on hybrid particle swarm optimization algorithm with genetic operators for the text clustering J Supercomput 73 4773-4795
  • [3] Abualigah LM(2017)A novel hybridization strategy for krill herd algorithm applied to clustering techniques Appl Soft Comput 60 423-435
  • [4] Khader AT(2018)A new feature selection method to improve the document clustering using particle swarm optimization algorithm J Comput Sci 25 456-466
  • [5] Abualigah LM(2018)Hybrid clustering analysis using improved krill herd algorithm Appl Intell 48 4047-4071
  • [6] Khader AT(2018)A combination of objective functions and hybrid Krill herd algorithm for text document clustering analysis Eng Appl Artif Intell 73 111-125
  • [7] Hanandeh ES(2011)KEEL data-mining software tool: data set repository, integration of algorithms and experimental analysis framework J Mult Valued Logic Soft Comput 17 255-287
  • [8] Abualigah LM(2012)DBFS: an effective density based feature selection scheme for small sample size and high dimensional imbalanced data sets Data Knowl Eng 81–82 67-103
  • [9] Khader AT(2014)MWMOTE—Majority weighted minority over-sampling technique for imbalanced data set learning IEEE Trans Knowl Data Eng 26 405-425
  • [10] Hanandeh ES(2004)A study of the behavior of several methods for balancing machine learning training data ACM SIGKDD Explor Newsl 6 20-29