An approach for classification of highly imbalanced data using weighting and undersampling

被引:137
作者
Anand, Ashish [1 ]
Pugalenthi, Ganesan [1 ]
Fogel, Gary B. [2 ]
Suganthan, P. N. [1 ]
机构
[1] Nanyang Technol Univ, Sch Elect & Elect Engn, Singapore 639798, Singapore
[2] Nat Select Inc, San Diego, CA 92121 USA
关键词
Imbalanced datasets; SVM; Undersampling technique; PROTEIN; PREDICTION; RESIDUES; SEQUENCE; SITES; IDENTIFICATION; CLASSIFIERS;
D O I
10.1007/s00726-010-0595-2
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Real-world datasets commonly have issues with data imbalance. There are several approaches such as weighting, sub-sampling, and data modeling for handling these data. Learning in the presence of data imbalances presents a great challenge to machine learning. Techniques such as support-vector machines have excellent performance for balanced data, but may fail when applied to imbalanced datasets. In this paper, we propose a new undersampling technique for selecting instances from the majority class. The performance of this approach was evaluated in the context of several real biological imbalanced data. The ratios of negative to positive samples vary from similar to 9:1 to similar to 100:1. Useful classifiers have high sensitivity and specificity. Our results demonstrate that the proposed selection technique improves the sensitivity compared to weighted support-vector machine and available results in the literature for the same datasets.
引用
收藏
页码:1385 / 1391
页数:7
相关论文
共 50 条
[31]   Automatic Classification of Lithofacies with Highly Imbalanced Dataset Using Multistage SVM Classifier [J].
Datta, Deepan ;
Singh, Gagandeep ;
Routray, Aurobinda ;
Mohanty, William K. ;
Mahadik, Rahul .
IECON 2021 - 47TH ANNUAL CONFERENCE OF THE IEEE INDUSTRIAL ELECTRONICS SOCIETY, 2021,
[32]   Addressing data complexity for imbalanced data sets: analysis of SMOTE-based oversampling and evolutionary undersampling [J].
Julián Luengo ;
Alberto Fernández ;
Salvador García ;
Francisco Herrera .
Soft Computing, 2011, 15 :1909-1936
[33]   LSTMAE-DWSSLM: A unified approach for imbalanced time series data classification [J].
Liu, Jingjing ;
Yao, Jiepeng ;
Zhou, Qiao ;
Wang, Zhongyi ;
Huang, Lan .
APPLIED INTELLIGENCE, 2023, 53 (18) :21077-21091
[34]   A Novel Imbalanced Data Classification Approach Based on Logistic Regression and Fisher Discriminant [J].
Shi, Baofeng ;
Wang, Jing ;
Qi, Junyan ;
Cheng, Yanqiu .
MATHEMATICAL PROBLEMS IN ENGINEERING, 2015, 2015
[35]   A novel imbalanced data classification approach for suicidal ideation detection on social media [J].
Mohamed Ali Ben Hassine ;
Safa Abdellatif ;
Sadok Ben Yahia .
Computing, 2022, 104 :741-765
[36]   A Hybrid Active-Passive Approach to Imbalanced Nonstationary Data Stream Classification [J].
Malialis, Kleanthis ;
Roveri, Manuel ;
Alippi, Cesare ;
Panayiotou, Christos G. ;
Polycarpou, Marios M. .
2022 IEEE SYMPOSIUM SERIES ON COMPUTATIONAL INTELLIGENCE (SSCI), 2022, :1021-1027
[37]   An Optimized k-NN Approach for Classification on Imbalanced Datasets with Missing Data [J].
Ozan, Ezgi Can ;
Riabchenko, Ekaterina ;
Kiranyaz, Serkan ;
Gabbouj, Moncef .
ADVANCES IN INTELLIGENT DATA ANALYSIS XV, 2016, 9897 :387-392
[38]   Lopinavir Resistance Classification with Imbalanced Data Using Probabilistic Neural Networks [J].
Raposo, Leticia M. ;
Arruda, Monica B. ;
de Brindeiro, Rodrigo M. ;
Nobre, Flavio F. .
JOURNAL OF MEDICAL SYSTEMS, 2016, 40 (03) :1-7
[39]   A novel imbalanced data classification approach for suicidal ideation detection on social media [J].
Ben Hassine, Mohamed Ali ;
Abdellatif, Safa ;
Ben Yahia, Sadok .
COMPUTING, 2022, 104 (04) :741-765
[40]   SVM classification for imbalanced data sets using a multiobjective optimization framework [J].
Askan, Aysegul ;
Sayin, Serpil .
ANNALS OF OPERATIONS RESEARCH, 2014, 216 (01) :191-203