A novel progressively undersampling method based on the density peaks sequence for imbalanced data

被引:60
作者
Xie, Xiaoying [1 ]
Liu, Huawen [2 ]
Zeng, Shouzhen [3 ]
Lin, Lingbin [4 ]
Li, Wen [5 ]
机构
[1] Zhejiang Normal Univ, Coll Econ & Management, Jinhua 321004, Zhejiang, Peoples R China
[2] Zhejiang Normal Univ, Coll Math & Comp Sci, Jinhua 321004, Zhejiang, Peoples R China
[3] Ningbo Univ, Sch Business, Ningbo 315211, Peoples R China
[4] Zhejiang Normal Univ, Student Management Off, Jinhua 321004, Zhejiang, Peoples R China
[5] Curtin Univ, Dept Math & Stat, Perth, WA 6845, Australia
关键词
Progressive undersampling; Density peaks sequence; Importance degree; Optimal undersampling size; Imbalanced data;
D O I
10.1016/j.knosys.2020.106689
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Undersampling is a widely used resampling technique for imbalanced data. As traditional undersampling techniques, typically making majority and minority classes in imbalanced data into the same scale, tend to miss valuable information, many strategies like clustering have been developed. However, two essential problems still remain and require more efforts to be put; that is, which and how many instances should be extracted in undersampling. To alleviate these two problems, in this paper we propose a novel undersampling method for imbalanced data. It exploits a sequence of density peaks to progressively extract instances from the majority classes of the imbalanced data. Specifically, two factors are introduced to measure the importance degree of each instance in the majority classes. With these two factors, we generate a sampling sequence based on the importance of instances for classification. Furthermore, the optimal undersampling size of the majority classes is automatically determined by progressively extracting the important instances from the sequence. To evaluate the effectiveness of the proposed method, a series of experiments comparing to six popular undersampling methods were conducted on 40 public benchmark datasets. The experimental results show that the performance of the proposed undersampling method is superior to the state-of-the-art undersampling methods. (C) 2020 Elsevier B.V. All rights reserved.
引用
收藏
页数:11
相关论文
共 50 条
[31]   Under-sampling method based on sample weight for imbalanced data [J].
Xiong B. ;
Wang G. ;
Deng W. .
Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2016, 53 (11) :2613-2622
[32]   Feature Selection Method Based on Weighted Mutual Information for Imbalanced Data [J].
Li, Kewen ;
Yu, Mingxiao ;
Liu, Lu ;
Li, Timing ;
Zhai, Jiannan .
INTERNATIONAL JOURNAL OF SOFTWARE ENGINEERING AND KNOWLEDGE ENGINEERING, 2018, 28 (08) :1177-1194
[33]   Preprocessing method based on sample resampling for imbalanced data of electronic circuits [J].
Li R. ;
Xu A. ;
Sun W. ;
Wu Y. .
Xi Tong Gong Cheng Yu Dian Zi Ji Shu/Systems Engineering and Electronics, 2020, 42 (11) :2654-2660
[34]   A novel twin-support vector machines method for binary classification to imbalanced data [J].
Li, Jingyi ;
Chao, Shiwei .
JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2023, 44 (04) :6901-6910
[35]   ReF-DDPM: A novel DDPM-based data augmentation method for imbalanced rolling bearing fault diagnosis [J].
Yu, Tian ;
Li, Chaoshun ;
Huang, Jie ;
Xiao, Xiangqu ;
Zhang, Xiaoyuan ;
Li, Yuhong ;
Fu, Bitao .
RELIABILITY ENGINEERING & SYSTEM SAFETY, 2024, 251
[36]   A novel fault diagnosis framework based on adaptive VAEGAN and optimal data selection for imbalanced data [J].
Hou, Yandong ;
Cai, Xiaoao ;
Chen, Zhengquan ;
Huang, Huige ;
Zhai, Xiaodong .
MEASUREMENT, 2025, 256
[37]   Predictive Modeling of ICU Healthcare-Associated Infections from Imbalanced Data. Using Ensembles and a Clustering-Based Undersampling Approach [J].
Sanchez-Hernandez, Fernando ;
Carlos Ballesteros-Herraez, Juan ;
Kraiem, Mohamed S. ;
Sanchez-Barba, Mercedes ;
Moreno-Garcia, Maria N. .
APPLIED SCIENCES-BASEL, 2019, 9 (24)
[38]   An Over-sampling Method Based on Probability Density Estimation for Imbalanced Datasets Classification [J].
Cao, Lu ;
Zhai, Yi-Kui .
PROCEEDINGS OF THE 2016 INTERNATIONAL CONFERENCE ON INTELLIGENT INFORMATION PROCESSING (ICIIP'16), 2016,
[39]   A new imbalanced data oversampling method based on Bootstrap method and Wasserstein Generative Adversarial Network [J].
Hou, Binjie ;
Chen, Gang .
MATHEMATICAL BIOSCIENCES AND ENGINEERING, 2024, 21 (03) :4309-4327
[40]   GIR-based canonical forest: An ensemble method for imbalanced big data [J].
Han, Solji ;
Myung, Jaesung ;
Kim, Hyunjoong .
KOREAN JOURNAL OF APPLIED STATISTICS, 2024, 37 (05)