Global-local information based oversampling for multi-class imbalanced data

被引:11
作者
Han, Mingming [1 ]
Guo, Husheng [1 ]
Li, Jinyan [3 ,4 ]
Wang, Wenjian [1 ,2 ]
机构
[1] Shanxi Univ, Sch Comp & Informat Technol, Taiyuan 030006, Shanxi, Peoples R China
[2] Shanxi Univ, Key Lab Computat Intelligence & Chinese Informat P, Minist Educ, Taiyuan 030006, Shanxi, Peoples R China
[3] Univ Technol Sydney, Adv Analyt Inst, Fac Engn, Broadway, NSW, Australia
[4] Univ Technol Sydney, IT, Broadway, NSW, Australia
基金
中国国家自然科学基金;
关键词
Oversampling; Intrinsic characteristics; Synthetic strategy; OVER-SAMPLING TECHNIQUE; DATA-SETS; SMOTE; CLASSIFICATION; ENSEMBLE;
D O I
10.1007/s13042-022-01746-w
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Multi-class imbalanced classification is a challenging problem in the field of machine learning. Many methods have been proposed to deal with it, and oversampling is one of the most popular techniques which alleviates class imbalance by generating instances for the minority classes. However, each oversampling utilizes a single way to generate instances for all candidate minority ones, which neglects the intrinsic characteristics among different minority class instances, and makes the synthetic instances redundant or ineffective. In this work, we propose a global-local-based oversampling method, termed GLOS. We introduce a new discreteness-based metric (DID) and distinguish the minority class from the majority class by comparing it with each class-level discreteness value. Then, for each minority class, some difficult-to-learn instances are selected, which have smaller instance-level dispersion than the corresponding class-level one, to generate synthetic instances. And the number of synthetic instances equals the difference between two types of dispersion values. These selected instances are assigned into different groups according to their local distribution. Furthermore, GLOS adopts a specific synthetic strategy to each group instance purposefully. Finally, all minority classes, part of the majority classes instances, and synthetic data will be used as training data. In this way, the quantity and quality of synthetic instances are guaranteed. Experimental results on KEEL and UCI data sets demonstrate the effectiveness of our proposal.
引用
收藏
页码:2071 / 2086
页数:16
相关论文
共 45 条
[1]   To Combat Multi-Class Imbalanced Problems by Means of Over-Sampling Techniques [J].
Abdi, Lida ;
Hashemi, Sattar .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2016, 28 (01) :238-251
[2]  
Alcalá-Fdez J, 2011, J MULT-VALUED LOG S, V17, P255
[3]  
Batista G.E., 2004, ACM SIGKDD EXPLORATI, V6, P20, DOI [10.1145/1007730.1007735, DOI 10.1145/1007730.1007735]
[4]  
Bunkhumpornpat C, 2009, LECT NOTES ARTIF INT, V5476, P475, DOI 10.1007/978-3-642-01307-2_43
[5]   SMOTE: Synthetic minority over-sampling technique [J].
Chawla, Nitesh V. ;
Bowyer, Kevin W. ;
Hall, Lawrence O. ;
Kegelmeyer, W. Philip .
2002, American Association for Artificial Intelligence (16)
[6]   SMOTEBoost: Improving prediction of the minority class in boosting [J].
Chawla, NV ;
Lazarevic, A ;
Hall, LO ;
Bowyer, KW .
KNOWLEDGE DISCOVERY IN DATABASES: PKDD 2003, PROCEEDINGS, 2003, 2838 :107-119
[7]   RAMOBoost: Ranked Minority Oversampling in Boosting [J].
Chen, Sheng ;
He, Haibo ;
Garcia, Edwardo A. .
IEEE TRANSACTIONS ON NEURAL NETWORKS, 2010, 21 (10) :1624-1642
[8]   Analysing the classification of imbalanced data-sets with multiple classes: Binarization techniques and ad-hoc approaches [J].
Fernandez, Alberto ;
Lopez, Victoria ;
Galar, Mikel ;
Jose del Jesus, Maria ;
Herrera, Francisco .
KNOWLEDGE-BASED SYSTEMS, 2013, 42 :97-110
[9]   An experimental comparison of performance measures for classification [J].
Ferri, C. ;
Hernandez-Orallo, J. ;
Modroiu, R. .
PATTERN RECOGNITION LETTERS, 2009, 30 (01) :27-38
[10]   A decision-theoretic generalization of on-line learning and an application to boosting [J].
Freund, Y ;
Schapire, RE .
JOURNAL OF COMPUTER AND SYSTEM SCIENCES, 1997, 55 (01) :119-139