An approach for classification of highly imbalanced data using weighting and undersampling

被引:129
|
作者
Anand, Ashish [1 ]
Pugalenthi, Ganesan [1 ]
Fogel, Gary B. [2 ]
Suganthan, P. N. [1 ]
机构
[1] Nanyang Technol Univ, Sch Elect & Elect Engn, Singapore 639798, Singapore
[2] Nat Select Inc, San Diego, CA 92121 USA
关键词
Imbalanced datasets; SVM; Undersampling technique; PROTEIN; PREDICTION; RESIDUES; SEQUENCE; SITES; IDENTIFICATION; CLASSIFIERS;
D O I
10.1007/s00726-010-0595-2
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Real-world datasets commonly have issues with data imbalance. There are several approaches such as weighting, sub-sampling, and data modeling for handling these data. Learning in the presence of data imbalances presents a great challenge to machine learning. Techniques such as support-vector machines have excellent performance for balanced data, but may fail when applied to imbalanced datasets. In this paper, we propose a new undersampling technique for selecting instances from the majority class. The performance of this approach was evaluated in the context of several real biological imbalanced data. The ratios of negative to positive samples vary from similar to 9:1 to similar to 100:1. Useful classifiers have high sensitivity and specificity. Our results demonstrate that the proposed selection technique improves the sensitivity compared to weighted support-vector machine and available results in the literature for the same datasets.
引用
收藏
页码:1385 / 1391
页数:7
相关论文
共 50 条
  • [1] An approach for classification of highly imbalanced data using weighting and undersampling
    Ashish Anand
    Ganesan Pugalenthi
    Gary B. Fogel
    P. N. Suganthan
    Amino Acids, 2010, 39 : 1385 - 1391
  • [2] Evolutionary Undersampling for Imbalanced Big Data Classification
    Triguero, I.
    Galar, M.
    Vluymans, S.
    Cornelis, C.
    Bustince, H.
    Herrera, F.
    Saeys, Y.
    2015 IEEE CONGRESS ON EVOLUTIONARY COMPUTATION (CEC), 2015, : 715 - 722
  • [3] Radial-Based Undersampling for imbalanced data classification
    Koziarski, Michal
    PATTERN RECOGNITION, 2020, 102
  • [4] Relevant information undersampling to support imbalanced data classification
    Hoyos-Osorio, J.
    Alvarez-Meza, A.
    Daza-Santacoloma, G.
    Orozco-Gutierrez, A.
    Castellanos-Dominguez, G.
    NEUROCOMPUTING, 2021, 436 : 136 - 146
  • [5] Overlap-Based Undersampling for Improving Imbalanced Data Classification
    Vuttipittayamongkol, Pattaramon
    Elyan, Eyad
    Petrovski, Andrei
    Jayne, Chrisina
    INTELLIGENT DATA ENGINEERING AND AUTOMATED LEARNING - IDEAL 2018, PT I, 2018, 11314 : 689 - 697
  • [6] Imbalanced text classification: A term weighting approach
    Liu, Ying
    Loh, Han Tong
    Sun, Aixin
    EXPERT SYSTEMS WITH APPLICATIONS, 2009, 36 (01) : 690 - 701
  • [7] UFIDSF: An undersampling approach based on feature importance and double side filter for imbalanced data classification
    Zheng, Ming
    Wang, Fei
    Hu, Xiaowen
    Hu, Liangchen
    Yu, Qingying
    Zheng, Xiaoyao
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2025, 167
  • [8] An Iterative Undersampling of Extremely Imbalanced Data Using CSVM
    Lee, Jong Bum
    Lee, Jee-Hyong
    SEVENTH INTERNATIONAL CONFERENCE ON MACHINE VISION (ICMV 2014), 2015, 9445
  • [9] PSU: Particle Stacking Undersampling Method for Highly Imbalanced Big Data
    Jeon, Yong-Seok
    Lim, Dong-Joon
    IEEE ACCESS, 2020, 8 : 131920 - 131927
  • [10] Undersampling with Support Vectors for Multi-Class Imbalanced Data Classification
    Krawczyk, Bartosz
    Bellinger, Colin
    Corizzo, Roberto
    Japkowicz, Nathalie
    2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,