Self-adaptive oversampling method based on the complexity of minority data in imbalanced datasets classification

被引:11
|
作者
Tao, Xinmin [1 ]
Guo, Xinyue [2 ]
Zheng, Yujia [1 ]
Zhang, Xiaohan [1 ]
Chen, Zhiyu [1 ]
机构
[1] Northeast Forestry Univ, Coll Civil Engn & Transportat, 26 Hexing Rd, Harbin 150040, Heilongjiang, Peoples R China
[2] Northeast Forestry Univ, Coll Mech & Elect Engn, Harbin 150040, Heilongjiang, Peoples R China
基金
中国国家自然科学基金;
关键词
Imbalanced datasets; Oversampling; Classification; Overlapping; Within-class imbalance; OVER-SAMPLING TECHNIQUE; SMOTE; NOISY;
D O I
10.1016/j.knosys.2023.110795
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Learning from imbalanced datasets is a nontrivial task for supervised learning community. Traditional classifiers may have difficulties to learn the concept related to the minority class when addressing imbalanced classification and the issues can become more deteriorated in the presence of other complicated aspects: overlapping, outliers and small disjuncts, etc. In this paper, we propose a selfadaptive oversampling algorithm based on the complexity of minority data for dealing with imbalanced datasets classification problems. In the proposed algorithm, various hyperspheres with different radii determined by imbalance ratio and the distances to the nearest enemy neighbors are firstly generated to cover all minority instances provided that they cannot contain any majority instance. Subsequently, the oversampling process is conducted only within these hyperspheres and thus the generated synthetic minority instances cannot intervene within the majority space, eventually avoiding overlapping issues during achieving between-class balance. In addition, a self-adaptive assignment strategy of oversampling sizes is developed based on the minority data complexity, where the hyperspheres with small radii and few instances in them are provided more chances to be oversampled. The strategy will favor addressing the outliers and small disjuncts issues since the hyperspheres covering the outliers and small disjuncts are usually of small sizes and contain few instances, which makes them have more chances to generate synthetic instances and thus eliminate within-class imbalance due to lack of density. Moreover, since the hyperspheres covering boundary minority instances are relatively small and thus are assigned with larger oversampling sizes, the proposed approach can also strengthen the boundary information of minority class, thus favoring the later learning tasks. The extensive experimental results on various simulated and real-world imbalanced datasets show that the proposed method significantly outperforms other state-of-the-art oversampling ones. & COPY; 2023 Elsevier B.V. All rights reserved.
引用
收藏
页数:23
相关论文
共 50 条
  • [41] A novel oversampling method based on SeqGAN for imbalanced text classification
    Luo, Yin
    Weng, Xuanlong
    Zheng, Huang
    Feng, Haishan
    Luang, Ke
    2019 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2019, : 2891 - 2894
  • [42] MI-MOTE: Multiple imputation-based minority oversampling technique for imbalanced and incomplete data classification
    Shin, Kyoham
    Han, Jongmin
    Kang, Seokho
    INFORMATION SCIENCES, 2021, 575 : 80 - 89
  • [43] Assessing the data complexity of imbalanced datasets
    Barella, Victor H.
    Garcia, Luis P. F.
    de Souto, Marcilio C. P.
    Lorena, Ana C.
    de Carvalho, Andre C. P. L. F.
    INFORMATION SCIENCES, 2021, 553 : 83 - 109
  • [44] An empirical comparison and evaluation of minority oversampling techniques on a large number of imbalanced datasets
    Kovacs, Gyorgy
    APPLIED SOFT COMPUTING, 2019, 83
  • [45] Fuzzy-synthetic minority oversampling technique: Oversampling based on fuzzy set theory for Android malware detection in imbalanced datasets
    Xu, Yanping
    Wu, Chunhua
    Zheng, Kangfeng
    Niu, Xinxin
    Yang, Yixian
    INTERNATIONAL JOURNAL OF DISTRIBUTED SENSOR NETWORKS, 2017, 13 (04):
  • [46] Imbalanced biomedical data classification using self-adaptive multilayer ELM combined with dynamic GAN
    Liyuan Zhang
    Huamin Yang
    Zhengang Jiang
    BioMedical Engineering OnLine, 17
  • [47] Imbalanced biomedical data classification using self-adaptive multilayer ELM combined with dynamic GAN
    Zhang, Liyuan
    Yang, Huamin
    Jiang, Zhengang
    BIOMEDICAL ENGINEERING ONLINE, 2018, 17
  • [48] A novel synthetic minority oversampling technique based on relative and absolute densities for imbalanced classification
    Liu, Ruijuan
    APPLIED INTELLIGENCE, 2023, 53 (01) : 786 - 803
  • [49] A novel synthetic minority oversampling technique based on relative and absolute densities for imbalanced classification
    Ruijuan Liu
    Applied Intelligence, 2023, 53 : 786 - 803
  • [50] CDSMOTE: class decomposition and synthetic minority class oversampling technique for imbalanced-data classification
    Elyan, Eyad
    Moreno-Garcia, Carlos Francisco
    Jayne, Chrisina
    NEURAL COMPUTING & APPLICATIONS, 2021, 33 (07): : 2839 - 2851