Clustering boundary over-sampling classification method for imbalanced data sets

被引:2
作者
Lou, Xiao-Jun [1 ]
Sun, Yu-Xuan [1 ]
Liu, Hai-Tao [1 ,2 ]
机构
[1] Shanghai Institute of Microsystem and Information Technology, Chinese Academy of Sciences
[2] Wuxi SensingNet Industrialization Research Institute
来源
Liu, H.-T. (liuhaitao@wsn.cn) | 1600年 / Zhejiang University卷 / 47期
关键词
Clustering boundary; Imbalanced data sets; K-nearest density; Over-sampling; Synthetic samples;
D O I
10.3785/j.issn.1008-973X.2013.06.003
中图分类号
学科分类号
摘要
The synthetic minority over-sampling technique (SMOTE) is a widely used method for imbalanced data classification. However, SMOTE synthesizes new samples without any guidance, which may lead to noise-sensitive and over-fitting. To resolve this problem, a novel over-sampling classification method for imbalanced data sets, called cluster boundary-synthetic minority over-sampling technique (CB-SMOTE), was proposed. Clustering consistency index was introduced to find the boundary minority samples. Then, k-nearest density was defined to calculate the number of synthetic new samples and to reject the noise samples, and it modified the rule of new samples synthesis. It is an over-sampling method with guidance, and the new samples generated by this method are much more beneficial for classifier learning. Six classification methods were compared using University of California Irvine (UCI) data sets. Experimental results show that the proposed method outperforms other methods in both minority samples and majority samples, and it is more stable in different over-sampling rates.
引用
收藏
页码:944 / 950
页数:6
相关论文
共 23 条
  • [1] Gu Q., Cai Z.-H., Zhu L., Et al., Data mining on imbalanced data sets, Proceedings of International Conference on Advanced Computer Theory and Engineering (ICACTE'08), pp. 1020-1024, (2008)
  • [2] Lin Z.-Y., Hao Z.-F., Yang X.-W., Current state of research on imbalanced data sets classification learning, Application Research of Computers, 25, 2, pp. 332-336, (2008)
  • [3] Ye Z.-F., Wen Y.-M., Lv B.-L., A survey of imbalanced pattern classification problems, CAAI Transactions on Intelligent Systems, 4, 2, pp. 148-156, (2009)
  • [4] He H.-B., Garcia E.A., Learning from imbalanced data, Knowledge and Data Engineering, 21, 9, pp. 1263-1284, (2009)
  • [5] Miao Z.-M., Zhao L.-W., Yuan W.-W., Et al., Multi-class imbalanced learning implemented in network intrusion, Proceedings of International Conference on Computer Science and Service System (CSSS'11), pp. 1395-1398, (2011)
  • [6] Liu Y.-Q., Wang C., Zhang L., Decision tree based predictive models for breast cancer survivability on imbalanced data, Proceedings of the 3rd International Conference on Bioinformatics and Biomedical Engineering (ICBBE'09), pp. 1-4, (2009)
  • [7] Estabrooks A., Jo T., Japkowicz N., A multiple resampling method for learning from imbalanced data sets, Computational Intelligence, 20, 1, pp. 18-36, (2004)
  • [8] Zhai Y., Ma N., Ruan D., Et al., An effective over-sampling method for imbalanced data sets classification, Chinese Journal of Electronics, 20, 3, pp. 489-494, (2011)
  • [9] Yen S.J., Lee Y.S., Cluster-based under-sampling approaches for imbalanced data distributions, Expert Systems with Applications, 36, 3, pp. 5718-5727, (2009)
  • [10] Garcia V., Sanchez J.S., Mollineda R.A., On the effectiveness of preprocessing methods when dealing with different levels of class imbalance, Knowledge-Based Systems, 25, 1, pp. 13-21, (2012)