TDMO: Dynamic multi-dimensional oversampling for exploring data distribution based on extreme gradient boosting learning

被引:7
作者
Jia, Liyan [1 ]
Wang, Zhiping [1 ]
Sun, Pengfei [1 ]
Xu, Zhaohui [2 ]
Yang, Sibo [1 ]
机构
[1] Dalian Maritime Univ, Sch Sci, Dalian 116000, Peoples R China
[2] Dalian Med Univ, Affiliated Hosp 1, Clin Lab Dept, Dalian 116011, Peoples R China
关键词
Class imbalance learning; Data distribution; Oversampling; k -nearest neighbors; SMOTE; RE-SAMPLING METHOD; SMOTE; CLASSIFICATION; MODEL; SVM;
D O I
10.1016/j.ins.2023.119621
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The synthetic minority oversampling technique (SMOTE) is the most general and popular solution for imbalanced data. Although SMOTE is effective in solving the class imbalance problem in most cases, it insufficiently exploits the data prior distribution. Additionally, most existing SMOTE variants randomly produce new instances between a minority sample and its nearest neighbors, which carries the risk of noise propagation. To address this, in this paper, local distribution trust estimation based on extreme gradient boosting (XGBoost) and dynamic multi-dimensional oversampling (TDMO) is proposed as a novel approach to exploring data distributions. First, undersampling and XGBoost techniques are introduced to train multiple balanced subsets to identify the internal structure of the original data and obtain the classification prediction accuracy of each instance, called the confidence level (CL). Then, instances with low CL (i.e., noise) are filtered out, and the densities of the two classes in the neighborhood of the non-noise instances are evaluated to create candidate samples to expand the diversity of the minority class. Finally, the minority class is enhanced by combining multiple samples in a multi-dimensional feature space. Extensive experimental results demonstrate that TDMO outperformed the comparative oversampling methods clearly and obtained the optimal classification results.
引用
收藏
页数:36
相关论文
共 50 条
[1]   A Powerful Predicting Model for Financial Statement Fraud Based on Optimized XGBoost Ensemble Learning Technique [J].
Al Ali, Amal ;
Khedr, Ahmed M. M. ;
El-Bannany, Magdi ;
Kanakkayil, Sakeena .
APPLIED SCIENCES-BASEL, 2023, 13 (04)
[2]   Comparing Sampling Strategies for Tackling Imbalanced Data in Human Activity Recognition [J].
Alharbi, Fayez ;
Ouarbya, Lahcen ;
Ward, Jamie A. .
SENSORS, 2022, 22 (04)
[3]   RN-SMOTE: Reduced Noise SMOTE based on DBSCAN for enhancing imbalanced data classification [J].
Arafa, Ahmed ;
El-Fishawy, Nawal ;
Badawy, Mohammed ;
Radad, Marwa .
JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES, 2022, 34 (08) :5059-5074
[4]   An Investigation of SMOTE Based Methods for Imbalanced Datasets With Data Complexity Analysis [J].
Azhar, Nur Athirah ;
Pozi, Muhammad Syafiq Mohd ;
Din, Aniza Mohamed ;
Jatowt, Adam .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2023, 35 (07) :6651-6672
[5]  
Barua Sukarna, 2013, Advances in Knowledge Discovery and Data Mining. 17th Pacific-Asia Conference (PAKDD 2013). Proceedings, P317, DOI 10.1007/978-3-642-37456-2_27
[6]   LoRAS: an oversampling approach for imbalanced datasets [J].
Bej, Saptarshi ;
Davtyan, Narek ;
Wolfien, Markus ;
Nassar, Mariam ;
Wolkenhauer, Olaf .
MACHINE LEARNING, 2021, 110 (02) :279-301
[7]  
Bunkhumpornpat C, 2009, LECT NOTES ARTIF INT, V5476, P475, DOI 10.1007/978-3-642-01307-2_43
[8]   PSO-based method for SVM classification on skewed data sets [J].
Cervantes, Jair ;
Garcia-Lamont, Farid ;
Rodriguez-Mazahua, Lisbeth ;
Lopez, Asdrubal ;
Ruiz-Castilla, Jose ;
Trueba, Adrian .
NEUROCOMPUTING, 2017, 228 :187-197
[9]   An efficiency curve for evaluating imbalanced classifiers considering intrinsic data characteristics: Experimental analysis [J].
Chao, Xiangrui ;
Kou, Gang ;
Peng, Yi ;
Fernandez, Alberto .
INFORMATION SCIENCES, 2022, 608 :1131-1156
[10]   SMOTE: Synthetic minority over-sampling technique [J].
Chawla, Nitesh V. ;
Bowyer, Kevin W. ;
Hall, Lawrence O. ;
Kegelmeyer, W. Philip .
2002, American Association for Artificial Intelligence (16)