OligoIS: Scalable Instance Selection for Class-Imbalanced Data Sets

被引:37
|
作者
Garcia-Pedrajas, Nicolas [1 ]
Perez-Rodriguez, Javier [1 ]
de Haro-Garcia, Aida [1 ]
机构
[1] Univ Cordoba, Dept Comp & Numer Anal, E-14071 Cordoba, Spain
关键词
Class-imbalance problem; instance selection; instance-based learning; very large problems; CLASSIFIERS; ALGORITHMS; REDUCTION; ENSEMBLES; RULE;
D O I
10.1109/TSMCB.2012.2206381
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In current research, an enormous amount of information is constantly being produced, which poses a challenge for data mining algorithms. Many of the problems in extremely active research areas, such as bioinformatics, security and intrusion detection, or text mining, share the following two features: large data sets and class-imbalanced distribution of samples. Although many methods have been proposed for dealing with class-imbalanced data sets, most of these methods are not scalable to the very large data sets common to those research fields. In this paper, we propose a new approach to dealing with the class-imbalance problem that is scalable to data sets with many millions of instances and hundreds of features. This proposal is based on the divide-and-conquer principle combined with application of the selection process to balanced subsets of the whole data set. This divide-and-conquer principle allows the execution of the algorithm in linear time. Furthermore, the proposed method is easy to implement using a parallel environment and can work without loading the whole data set into memory. Using 40 class-imbalanced medium-sized data sets, we will demonstrate our method's ability to improve the results of state-of-the-art instance selection methods for class-imbalanced data sets. Using three very large data sets, we will show the scalability of our proposal to millions of instances and hundreds of features.
引用
收藏
页码:332 / 346
页数:15
相关论文
共 50 条
  • [1] Feature selection for high-dimensional class-imbalanced data sets using Support Vector Machines
    Maldonado, Sebastian
    Weber, Richard
    Famili, Fazel
    INFORMATION SCIENCES, 2014, 286 : 228 - 246
  • [2] A Scalable Exemplar-Based Subspace Clustering Algorithm for Class-Imbalanced Data
    You, Chong
    Li, Chi
    Robinson, Daniel P.
    Vidal, Rene
    COMPUTER VISION - ECCV 2018, PT IX, 2018, 11213 : 68 - 85
  • [3] Online feature selection for high-dimensional class-imbalanced data
    Zhou, Peng
    Hu, Xuegang
    Li, Peipei
    Wu, Xindong
    KNOWLEDGE-BASED SYSTEMS, 2017, 136 : 187 - 199
  • [4] Feature selection and classification by minimizing overlap degree for class-imbalanced data in metabolomics
    Fu, Guang-Hui
    Wu, Yuan-Jiao
    Zong, Min-Jie
    Yi, Lun-Zhao
    CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS, 2020, 196
  • [5] Stable variable selection of class-imbalanced data with precision-recall criterion
    Fu, Guang-Hui
    Xu, Feng
    Zhang, Bing-Yang
    Yi, Lun-Zhao
    CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS, 2017, 171 : 241 - 250
  • [6] Exploring of clustering algorithm on class-imbalanced data
    Li Xuan
    Chen Zhigang
    Yang Fan
    PROCEEDINGS OF THE 2013 8TH INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE & EDUCATION (ICCSE 2013), 2013, : 89 - 93
  • [7] Class prediction for high-dimensional class-imbalanced data
    Blagus, Rok
    Lusa, Lara
    BMC BIOINFORMATICS, 2010, 11 : 523
  • [8] Research On Classification Method Of High-Dimensional Class-Imbalanced Data Sets Based On SVM
    Zhang, Chunkai
    Guo, Jianwei
    Lu, Junru
    2017 IEEE SECOND INTERNATIONAL CONFERENCE ON DATA SCIENCE IN CYBERSPACE (DSC), 2017, : 60 - 67
  • [9] Class prediction for high-dimensional class-imbalanced data
    Rok Blagus
    Lara Lusa
    BMC Bioinformatics, 11
  • [10] A Re-Balancing Strategy for Class-Imbalanced Classification Based on Instance Difficulty
    Yu, Sihao
    Guo, Jiafeng
    Zhang, Ruqing
    Fan, Yixing
    Wang, Zizhen
    Cheng, Xueqi
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 70 - 79