OligoIS: Scalable Instance Selection for Class-Imbalanced Data Sets

被引:37
|
作者
Garcia-Pedrajas, Nicolas [1 ]
Perez-Rodriguez, Javier [1 ]
de Haro-Garcia, Aida [1 ]
机构
[1] Univ Cordoba, Dept Comp & Numer Anal, E-14071 Cordoba, Spain
关键词
Class-imbalance problem; instance selection; instance-based learning; very large problems; CLASSIFIERS; ALGORITHMS; REDUCTION; ENSEMBLES; RULE;
D O I
10.1109/TSMCB.2012.2206381
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In current research, an enormous amount of information is constantly being produced, which poses a challenge for data mining algorithms. Many of the problems in extremely active research areas, such as bioinformatics, security and intrusion detection, or text mining, share the following two features: large data sets and class-imbalanced distribution of samples. Although many methods have been proposed for dealing with class-imbalanced data sets, most of these methods are not scalable to the very large data sets common to those research fields. In this paper, we propose a new approach to dealing with the class-imbalance problem that is scalable to data sets with many millions of instances and hundreds of features. This proposal is based on the divide-and-conquer principle combined with application of the selection process to balanced subsets of the whole data set. This divide-and-conquer principle allows the execution of the algorithm in linear time. Furthermore, the proposed method is easy to implement using a parallel environment and can work without loading the whole data set into memory. Using 40 class-imbalanced medium-sized data sets, we will demonstrate our method's ability to improve the results of state-of-the-art instance selection methods for class-imbalanced data sets. Using three very large data sets, we will show the scalability of our proposal to millions of instances and hundreds of features.
引用
收藏
页码:332 / 346
页数:15
相关论文
共 50 条
  • [21] A Study of Prototype Selection Algorithms for Nearest Neighbour in Class-Imbalanced Problems
    Valero-Mas, Jose J.
    Calvo-Zaragoza, Jorge
    Rico-Juan, Juan R.
    Inesta, Jose M.
    PATTERN RECOGNITION AND IMAGE ANALYSIS (IBPRIA 2017), 2017, 10255 : 335 - 343
  • [22] A classification method for class-imbalanced data and its application on bioinformatics
    Zou, Quan
    Guo, Maozu
    Liu, Yang
    Wang, Jun
    Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2010, 47 (08): : 1407 - 1414
  • [23] Comparison of Two Frameworks for Measuring the Stability of Gene-Selection Techniques on Noisy Class-Imbalanced Data
    Wald, Randall
    Khoshgoftaar, Taghi M.
    Abu Shanab, Ahmad
    2013 IEEE 25TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI), 2013, : 881 - 888
  • [24] Undersampling Instance Selection for Hybrid and Incomplete Imbalanced Data
    Camacho-Nieto, Oscar
    Yanez-Marquez, Cornelio
    Villuendas-Rey, Yenny
    JOURNAL OF UNIVERSAL COMPUTER SCIENCE, 2020, 26 (06) : 698 - 719
  • [25] A Hybrid Framework for Class-Imbalanced Classification
    Chen, Rui
    Luo, Lailong
    Chen, Yingwen
    Xia, Junxu
    Guo, Deke
    WIRELESS ALGORITHMS, SYSTEMS, AND APPLICATIONS, WASA 2021, PT I, 2021, 12937 : 301 - 313
  • [26] Dynamic financial distress prediction based on class-imbalanced data batches
    Sun, Jie
    Liu, Xin
    Ai, Wenguo
    Tian, Qianyuan
    INTERNATIONAL JOURNAL OF FINANCIAL ENGINEERING, 2021, 8 (03)
  • [27] Kernel Matrix Approximation on Class-Imbalanced Data With an Application to Scientific Simulation
    Hajibabaee, Parisa
    Pourkamali-Anaraki, Farhad
    Hariri-Ardebili, Mohammad Amin
    IEEE ACCESS, 2021, 9 : 83579 - 83591
  • [28] A deep multimodal generative and fusion framework for class-imbalanced multimodal data
    Li, Qing
    Yu, Guanyuan
    Wang, Jun
    Liu, Yuehao
    MULTIMEDIA TOOLS AND APPLICATIONS, 2020, 79 (33-34) : 25023 - 25050
  • [29] GANs for Class-Imbalanced Data: A Meta-Analysis of GitHub Projects
    Sauber-Cole, Rick
    Khoshgoftaar, Taghi M.
    Johnson, Justin M.
    2022 IEEE 34TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE, ICTAI, 2022, : 1419 - 1424
  • [30] Evaluation of SMOTE for high-dimensional class-imbalanced microarray data
    Blagus, Rok
    Lusa, Lara
    2012 11TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA 2012), VOL 2, 2012, : 89 - 94