Cluster-based under-sampling approaches for imbalanced data distributions

被引:483
|
作者
Yen, Show-Jane
Lee, Yue-Shi
机构
[1] Department of Computer Science and Information Engineering, Ming Chuan University, Gwei Shan District, Taoyuan County 333
关键词
Classification; Data mining; Under-sampling; Imbalanced data distribution;
D O I
10.1016/j.eswa.2008.06.108
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
For classification problem, the training data will significantly influence the classification accuracy. However, data in real-world applications often are imbalanced class distribution, that is, most of the data ever, are in majority class and little data are in minority class. In this case, if all the data are used to be the training data, the classifier tends to predict that most of the incoming data belongs to the majority class. Hence, it is important to select the suitable training data for classification in the imbalanced class distribution problem. In this paper, we propose cluster-based under-sampling approaches for selecting the representative data as training data to improve the classification accuracy for minority class and investigate the effect of under-sampling methods in the imbalanced class distribution environment. The experimental results show that our cluster-based under-sampling approaches outperform the other under-sampling techniques in the previous studies. (C) 2008 Elsevier Ltd. All rights reserved.
引用
收藏
页码:5718 / 5727
页数:10
相关论文
共 50 条
  • [31] An Improved Under-sampling Imbalanced Classification Algorithm
    Yao, Baofeng
    Wang, Lei
    2021 13TH INTERNATIONAL CONFERENCE ON MEASURING TECHNOLOGY AND MECHATRONICS AUTOMATION (ICMTMA 2021), 2021, : 775 - 779
  • [32] Cluster-Based Minority Over-Sampling for Imbalanced Datasets
    Puntumapon, Kamthorn
    Rakthamamon, Thanawin
    Waiyamai, Kitsana
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2016, E99D (12): : 3101 - 3109
  • [33] An Under-sampling Method Based on Fuzzy Logic for Large Imbalanced Dataset
    Wong, Ginny Y.
    Leung, Frank H. F.
    Ling, Sai-Ho
    2014 IEEE INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS (FUZZ-IEEE), 2014, : 1248 - 1252
  • [34] A binary PSO-based ensemble under-sampling model for rebalancing imbalanced training data
    Jinyan Li
    Yaoyang Wu
    Simon Fong
    Antonio J. Tallón-Ballesteros
    Xin-she Yang
    Sabah Mohammed
    Feng Wu
    The Journal of Supercomputing, 2022, 78 : 7428 - 7463
  • [35] An Imbalanced Multi-Label Data Ensemble Learning Method Based on Safe Under-Sampling
    Sun, Zhong-Bin
    Diao, Yu-Xuan
    Ma, Su-Yang
    Tien Tzu Hsueh Pao/Acta Electronica Sinica, 2024, 52 (10): : 3392 - 3408
  • [36] A binary PSO-based ensemble under-sampling model for rebalancing imbalanced training data
    Li, Jinyan
    Wu, Yaoyang
    Fong, Simon
    Tallon-Ballesteros, Antonio J.
    Yang, Xin-she
    Mohammed, Sabah
    Wu, Feng
    JOURNAL OF SUPERCOMPUTING, 2022, 78 (05): : 7428 - 7463
  • [37] Several SVM Ensemble Methods Integrated with Under-Sampling for Imbalanced Data Learning
    Lin, ZhiYong
    Hao, ZhiFeng
    Yang, XiaoWei
    Liu, XiaoLan
    ADVANCED DATA MINING AND APPLICATIONS, PROCEEDINGS, 2009, 5678 : 536 - +
  • [38] Cluster-Based Instance Selection for the Imbalanced Data Classification
    Czarnowski, Ireneusz
    Jedrzejowicz, Piotr
    COMPUTATIONAL COLLECTIVE INTELLIGENCE, ICCCI 2018, PT II, 2018, 11056 : 191 - 200
  • [39] A cluster-based SMOTE both-sampling (CSBBoost) ensemble algorithm for classifying imbalanced data
    Amir Reza Salehi
    Majid Khedmati
    Scientific Reports, 14
  • [40] A cluster-based SMOTE both-sampling (CSBBoost) ensemble algorithm for classifying imbalanced data
    Salehi, Amir Reza
    Khedmati, Majid
    SCIENTIFIC REPORTS, 2024, 14 (01)