Imbalanced data optimization combining K-means and SMOTE

被引:2
作者
Li W. [1 ]
机构
[1] Hebei Vocational and Technical College of Building Materials, Qinhuangdao
来源
International Journal of Performability Engineering | 2019年 / 15卷 / 08期
关键词
Classification; Imbalanced data; K-Means; Random forest; SMOTE;
D O I
10.23940/ijpe.19.08.p17.21732181
中图分类号
学科分类号
摘要
With the wide application of imbalanced data processing in various fields, such as credit card fraud identification, network intrusion detection, cancer detection, commodity recommendation, software defect prediction, and customer churn prediction, imbalanced data has become one of the current research hotspots. When classifying imbalanced data sets, aiming at the problems of low classification accuracy of negative class samples in the random forest algorithm and marginalization for selecting new samples in the SMOTE algorithm, a new algorithm, KMS_SMOTE, is proposed to deal with imbalanced data sets. In order to avoid the problem of marginalization of new samples, the K-Means algorithm is used to classify the negative class samples to obtain the centroid of the negative class samples, and then the new data set is obtained by selecting the samples near the centroid. Finally, in order to verify the effect of the KMS_SMOTE algorithm, it is compared with the SMOTE algorithm on the data sets of UCI machine learning. The experimental results show that the KMS_SMOTE algorithm effectively improves the classification performance of the random forest algorithm on the imbalanced data set. © 2019 Totem Publisher, Inc. All rights reserved.
引用
收藏
页码:2173 / 2181
页数:8
相关论文
共 28 条
  • [1] Jing Q., Qian X.Z., Wang W.T., A parallel random forest algorithm for imbalanced big data, Microelectronics and Computer, 34, 4, pp. 22-27, (2017)
  • [2] Xue L., Zhang S.W., Imbalanced data classification Algorithm based on quadratic random Forest, Software, 37, 7, pp. 75-79, (2016)
  • [3] Chang R.F., Wu W.J., Moon W.K., Support vector machines for diagnosis of breast tumors on US images, Academic Radiology, 10, 2, pp. 189-197, (2003)
  • [4] Shi Y., Li X.M., Qi X.H., Classification research of SVM with imbalanced data based on a new type of under sampling samples, Computer Measurement and Control, 20, 5, pp. 1203-1235, (2012)
  • [5] Chan P.K., Stolfo S.J., Toward scalable learning with non-uniform class and cost distributions: A case study in credit card fraud detection, Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining, pp. 164-168, (1998)
  • [6] Sun G.L., Li S., Cao Y., Lang F., Cervical cancer diagnosis based on random forest, International Journal of Performability Engineering, 13, 1, pp. 446-457, (2017)
  • [7] Chawla N.V., Bowyer K.W., Hall L.O., SMOTE: Synthetic minority over-sampling technique, Journal of Artificial Intelligence Research, 16, 1, pp. 321-357, (2011)
  • [8] Han H., Wang W.Y., Mao B.H., Borderline-Smote: A new over-sampling method in imbalanced data sets learning, Proceedings of the 1th International Conference on Intelligent Computing, pp. 878-887, (2005)
  • [9] Dong Y.J., Wang X.H., A new over-sampling approach: Random-SMOTE for learning from imbalanced data sets, Knowledge Science, Engineering and Management, 7091, pp. 343-352, (2011)
  • [10] Wang X.C., Pan Z.M., Dong L.L., Research on Classification for imbalanced dataset based on Improved SMOTE, Computer Engineering and Applications, 49, 2, pp. 184-187, (2013)