Safe sample screening based sampling method for imbalanced data

被引:0
作者
Shi H. [1 ]
Liu Y. [1 ]
Ji S. [1 ]
机构
[1] College of Information, Shanxi University of Finance and Economics, Taiyuan
来源
Moshi Shibie yu Rengong Zhineng/Pattern Recognition and Artificial Intelligence | 2019年 / 32卷 / 06期
基金
中国国家自然科学基金;
关键词
Imbalanced Data; Safe Sample Screening; Synthetic Minority Oversampling Technique(SMOTE); Undersampling; Imbalance Ratio;
D O I
10.16451/j.cnki.issn1003-6059.201906007
中图分类号
学科分类号
摘要
The loss of valuable information may be caused by undersampling, and the class overlapping between the majority class and the minority class may be aggravated by the synthetic minority oversampling technique(SMOTE). A sampling method, Screening_SMOTE, is proposed in this paper, combining safe sample screening based undersampling with SMOTE. Parts of non-informative instances and noise instances in the majority class are identified and discarded by the undersampling method using safe screening rules. Then, the minority class instances generated by SMOTE are added into the screened dataset. The loss of informative information is avoided and the noise instances in the majority class are discarded using safe sample screening based undersampling, relieving the class overlapping. The experimental results show that Screening_SMOTE is an effective method of rebalancing imbalanced datasets, especially for high dimensional imbalanced datasets. © 2019, Science Press. All right reserved.
引用
收藏
页码:545 / 556
页数:11
相关论文
共 36 条
[1]  
Tsang S., Koh Y.S., Dobbie G., Et al., Detecting Online Auction Shilling Frauds Using Supervised Learning, Expert Systems with Applications, 41, 6, pp. 3027-3040, (2014)
[2]  
Hassan A.K.I., Abraham A., Modeling Insurance Fraud Detection Using Imbalanced Data Classification, Eds. Advances in Nature and Biologically Inspired Computing, pp. 117-127, (2016)
[3]  
Almendra V., Finding the Needle: A Risk-Based Ranking of Product Listings at Online Auction Sites for Non-delivery Fraud Prediction, Expert Systems with Applications, 40, 12, pp. 4805-4811, (2013)
[4]  
Yu L., Zhou R.T., Tang L., Et al., A DBN-Based Resampling SVM Ensemble Learning Paradigm for Credit Classification with Imbalanced Data, Applied Soft Computing, 69, 8, pp. 192-202, (2018)
[5]  
Tsai C.F., Hsu Y.F., Lin C.Y., Et al., Intrusion Detection by Machine Learning: A Review, Expert Systems with Applications, 36, 10, pp. 11994-12000, (2009)
[6]  
Zhou C.V., Leckie C., Karunasekera S., A Survey of Coordinated Attacks and Collaborative Intrusion Detection, Computer and Security, 29, 1, pp. 124-140, (2010)
[7]  
Ren F.L., Cao P., Li W., Et al., Ensemble Based Adaptive Over-Sampling Method for Imbalanced Data Learning in Computer Aided Detection of Microaneurysm, Computerized Medical Imaging and Graphics, 55, 1, pp. 54-67, (2017)
[8]  
Blaszczyn-Ski J., Stefanowski J., Neighbourhood Sampling in Bagging for Imbalanced Data, Neurocomputing, 150, pp. 529-542, (2015)
[9]  
Zhang J.P., Mani I., KNN Approach to Unbalanced Data Distributions: A Case Study Involving Information Extraction, Proc of the 20th International Conference on Machine Learning, pp. 42-48, (2003)
[10]  
Yen S.J., Lee Y.S., Cluster-Based Under-Sampling Approaches for Imbalanced Data Distributions, Expert Systems with Applications, 36, 3, pp. 5718-5727, (2009)