Under-sampling method based on sample weight for imbalanced data

被引:3
作者
Xiong B. [1 ]
Wang G. [1 ]
Deng W. [1 ]
机构
[1] Chongqing Key Laboratory of Computational Intelligence, Chongqing University of Posts and Telecommunications, Chongqing
来源
Jisuanji Yanjiu yu Fazhan/Computer Research and Development | 2016年 / 53卷 / 11期
基金
中国国家自然科学基金;
关键词
Clustering; Ensemble learning; Imbalanced data; Sample weight; Under-sampling;
D O I
10.7544/issn1000-1239.2016.20150593
中图分类号
学科分类号
摘要
Imbalanced data exists widely in the real world, and its classification is a hot topic in data mining and machine learning. Under-sampling is a widely used method in dealing imbalanced data set and its main idea is choosing a subset of majority class to make the data set balanced. However, some useful majority class information may be lost. In order to solve the problem, an under-sampling method based on sample weight for imbalance problem is proposed, named as KAcBag (K-means AdaCost bagging). In this method, sample weight is introduced to reveal the area where the sample is located. Firstly, according to the sample scale, a weight is made for each sample and is modified after clustering the data set. The samples which have less weight in the center of majority class. Then some samples are drawn from majority class in accordance with the sample weight. In the procedure, the samples in the center of majority class can be selected easily. The sampled majority class samples and all the minority class samples are combined as the training data set for a component classifier. After that, we can get several decision tree sub-classifiers. Finally, the prediction model is constructed based on the accuracy of each sub-classifiers. Experimental tests on nineteen UCI data sets and telecom user data show that KAcBag can make the selected samples have more representativeness. Based on that, this method can improve the the classification performance of minority class and reduce the scale of the problem. © 2016, Science Press. All right reserved.
引用
收藏
页码:2613 / 2622
页数:9
相关论文
共 23 条
[1]  
Xiao J., Xie L., He C., Et al., Dynamic classifier ensemble model for customer classification with imbalanced class distribution, Expert Systems with Applications, 39, 3, pp. 3668-3675, (2012)
[2]  
Yu H., Ni J., Zhao J., ACOSampling: An ant colony optimization-based undersampling method for classifying imbalanced DNA microarray data, Neurocomputing, 101, pp. 309-318, (2013)
[3]  
Wang S., Yao X., Using class imbalance learning for software defect prediction, IEEE Trans on Reliability, 62, 2, pp. 434-443, (2013)
[4]  
Ma Z., Yan R., Yuan D., Et al., An imbalanced spam mail filtering method, International Journal of Multimedia and Ubiquitous Engineering, 10, 3, pp. 119-126, (2015)
[5]  
Kim H., Howland P., Park H., Dimension reduction in text classification with support vector machines, Journal of Machine Learning Research, 6, 1, pp. 37-53, (2005)
[6]  
Maes F., Vandermeulen D., Suetens P., Medical image registration using mutual information, Proceedings of the IEEE, 91, 10, pp. 1699-1722, (2003)
[7]  
Chawla N.V., Bowyer K.W., Hall L.O., Et al., SMOTE: Synthetic minority over-sampling technique, Journal of Artificial Intelligence Research, 16, 1, pp. 321-357, (2002)
[8]  
Han H., Wang W.Y., Mao B.H., Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning, Proc of the Int Conf on Intelligent Computing, pp. 878-887, (2005)
[9]  
Kubat M., Matwin S., Addressing the curse of imbalanced training sets: One-sided selection, Proc of the 14th Int Conf on Machine Learning, pp. 179-186, (1997)
[10]  
Yen S.J., Lee Y.S., Cluster based under-sampling approaches for imbalanced data distributions, Expert Systems with Applications, 36, 3, pp. 5718-5727, (2009)