OligoIS: Scalable Instance Selection for Class-Imbalanced Data Sets

被引:37
|
作者
Garcia-Pedrajas, Nicolas [1 ]
Perez-Rodriguez, Javier [1 ]
de Haro-Garcia, Aida [1 ]
机构
[1] Univ Cordoba, Dept Comp & Numer Anal, E-14071 Cordoba, Spain
关键词
Class-imbalance problem; instance selection; instance-based learning; very large problems; CLASSIFIERS; ALGORITHMS; REDUCTION; ENSEMBLES; RULE;
D O I
10.1109/TSMCB.2012.2206381
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In current research, an enormous amount of information is constantly being produced, which poses a challenge for data mining algorithms. Many of the problems in extremely active research areas, such as bioinformatics, security and intrusion detection, or text mining, share the following two features: large data sets and class-imbalanced distribution of samples. Although many methods have been proposed for dealing with class-imbalanced data sets, most of these methods are not scalable to the very large data sets common to those research fields. In this paper, we propose a new approach to dealing with the class-imbalance problem that is scalable to data sets with many millions of instances and hundreds of features. This proposal is based on the divide-and-conquer principle combined with application of the selection process to balanced subsets of the whole data set. This divide-and-conquer principle allows the execution of the algorithm in linear time. Furthermore, the proposed method is easy to implement using a parallel environment and can work without loading the whole data set into memory. Using 40 class-imbalanced medium-sized data sets, we will demonstrate our method's ability to improve the results of state-of-the-art instance selection methods for class-imbalanced data sets. Using three very large data sets, we will show the scalability of our proposal to millions of instances and hundreds of features.
引用
收藏
页码:332 / 346
页数:15
相关论文
共 50 条
  • [31] Ensemble Strategy for Hard Classifying Samples in Class-Imbalanced Data Set
    Yang, Yingze
    Xiao, Pengcheng
    Cheng, Yijun
    Liu, Weirong
    Huang, Zhiwu
    2018 IEEE INTERNATIONAL CONFERENCE ON BIG DATA AND SMART COMPUTING (BIGCOMP), 2018, : 170 - 175
  • [32] A deep multimodal generative and fusion framework for class-imbalanced multimodal data
    Qing Li
    Guanyuan Yu
    Jun Wang
    Yuehao Liu
    Multimedia Tools and Applications, 2020, 79 : 25023 - 25050
  • [33] Online Streaming Feature Selection for High-Dimensional and Class-Imbalanced Data Based on Neighborhood Rough Set
    Chen X.
    Lin Y.
    Wang C.
    Moshi Shibie yu Rengong Zhineng/Pattern Recognition and Artificial Intelligence, 2019, 32 (08): : 726 - 735
  • [34] Adaptive fuzzy multi-neighborhood feature selection with hybrid sampling and its application for class-imbalanced data
    Sun, Lin
    Li, Mengmeng
    Ding, Weiping
    Xu, Jiucheng
    APPLIED SOFT COMPUTING, 2023, 149
  • [35] Hellinger distance-based stable sparse feature selection for high-dimensional class-imbalanced data
    Fu, Guang-Hui
    Wu, Yuan-Jiao
    Zong, Min-Jie
    Pan, Jianxin
    BMC BIOINFORMATICS, 2020, 21 (01)
  • [36] Hellinger distance-based stable sparse feature selection for high-dimensional class-imbalanced data
    Guang-Hui Fu
    Yuan-Jiao Wu
    Min-Jie Zong
    Jianxin Pan
    BMC Bioinformatics, 21
  • [37] GRASP for Instance Selection in Medical Data Sets
    Fernandez, Alfonso
    Duarte, Abraham
    Hernandez, Rosa
    Sanchez, Angel
    ADVANCES IN BIOINFORMATICS, 2010, 74 : 53 - 60
  • [38] Margin calibration in SVM class-imbalanced learning
    Yang, Chan-Yun
    Yang, Jr-Syu
    Wang, Jian-Jun
    NEUROCOMPUTING, 2009, 73 (1-3) : 397 - 411
  • [39] A Cluster-Based Under-Sampling Algorithm for Class-Imbalanced Data
    Guzman-Ponce, A.
    Valdovinos, R. M.
    Sanchez, J. S.
    HYBRID ARTIFICIAL INTELLIGENT SYSTEMS, HAIS 2020, 2020, 12344 : 299 - 311
  • [40] Feature Selection for Multi-Class Imbalanced Data Sets Based on Genetic Algorithm
    Du L.-M.
    Xu Y.
    Zhu H.
    Ann. Data Sci., 3 (293-300): : 293 - 300