OligoIS: Scalable Instance Selection for Class-Imbalanced Data Sets

被引：37

作者：

Garcia-Pedrajas, Nicolas ^{[1
]}

Perez-Rodriguez, Javier ^{[1
]}

de Haro-Garcia, Aida ^{[1
]}

机构：

[1] Univ Cordoba, Dept Comp & Numer Anal, E-14071 Cordoba, Spain

来源：

IEEE TRANSACTIONS ON CYBERNETICS | 2013年 / 43卷 / 01期

关键词：

Class-imbalance problem; instance selection; instance-based learning; very large problems; CLASSIFIERS; ALGORITHMS; REDUCTION; ENSEMBLES; RULE;

D O I：

10.1109/TSMCB.2012.2206381

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

In current research, an enormous amount of information is constantly being produced, which poses a challenge for data mining algorithms. Many of the problems in extremely active research areas, such as bioinformatics, security and intrusion detection, or text mining, share the following two features: large data sets and class-imbalanced distribution of samples. Although many methods have been proposed for dealing with class-imbalanced data sets, most of these methods are not scalable to the very large data sets common to those research fields. In this paper, we propose a new approach to dealing with the class-imbalance problem that is scalable to data sets with many millions of instances and hundreds of features. This proposal is based on the divide-and-conquer principle combined with application of the selection process to balanced subsets of the whole data set. This divide-and-conquer principle allows the execution of the algorithm in linear time. Furthermore, the proposed method is easy to implement using a parallel environment and can work without loading the whole data set into memory. Using 40 class-imbalanced medium-sized data sets, we will demonstrate our method's ability to improve the results of state-of-the-art instance selection methods for class-imbalanced data sets. Using three very large data sets, we will show the scalability of our proposal to millions of instances and hundreds of features.

引用

页码：332 / 346

页数：15

共 50 条

[31] Ensemble Strategy for Hard Classifying Samples in Class-Imbalanced Data Set
Yang, Yingze
Xiao, Pengcheng
Cheng, Yijun
Liu, Weirong
Huang, Zhiwu
2018 IEEE INTERNATIONAL CONFERENCE ON BIG DATA AND SMART COMPUTING (BIGCOMP), 2018, : 170 - 175
[32] A deep multimodal generative and fusion framework for class-imbalanced multimodal data
Qing Li
Guanyuan Yu
Jun Wang
Yuehao Liu
Multimedia Tools and Applications, 2020, 79 : 25023 - 25050
[33] Online Streaming Feature Selection for High-Dimensional and Class-Imbalanced Data Based on Neighborhood Rough Set
Chen X.
Lin Y.
Wang C.
Moshi Shibie yu Rengong Zhineng/Pattern Recognition and Artificial Intelligence, 2019, 32 (08): : 726 - 735
[34] Adaptive fuzzy multi-neighborhood feature selection with hybrid sampling and its application for class-imbalanced data
Sun, Lin
Li, Mengmeng
Ding, Weiping
Xu, Jiucheng
APPLIED SOFT COMPUTING, 2023, 149
[35] Hellinger distance-based stable sparse feature selection for high-dimensional class-imbalanced data
Fu, Guang-Hui
Wu, Yuan-Jiao
Zong, Min-Jie
Pan, Jianxin
BMC BIOINFORMATICS, 2020, 21 (01)
[36] Hellinger distance-based stable sparse feature selection for high-dimensional class-imbalanced data
Guang-Hui Fu
Yuan-Jiao Wu
Min-Jie Zong
Jianxin Pan
BMC Bioinformatics, 21
[37] GRASP for Instance Selection in Medical Data Sets
Fernandez, Alfonso
Duarte, Abraham
Hernandez, Rosa
Sanchez, Angel
ADVANCES IN BIOINFORMATICS, 2010, 74 : 53 - 60
[38] Margin calibration in SVM class-imbalanced learning
Yang, Chan-Yun
Yang, Jr-Syu
Wang, Jian-Jun
NEUROCOMPUTING, 2009, 73 (1-3) : 397 - 411
[39] A Cluster-Based Under-Sampling Algorithm for Class-Imbalanced Data
Guzman-Ponce, A.
Valdovinos, R. M.
Sanchez, J. S.
HYBRID ARTIFICIAL INTELLIGENT SYSTEMS, HAIS 2020, 2020, 12344 : 299 - 311
[40] Feature Selection for Multi-Class Imbalanced Data Sets Based on Genetic Algorithm
Du L.-M.
Xu Y.
Zhu H.
Ann. Data Sci., 3 (293-300): : 293 - 300

← 1 2 3 4 5 →