OligoIS: Scalable Instance Selection for Class-Imbalanced Data Sets

被引:37
|
作者
Garcia-Pedrajas, Nicolas [1 ]
Perez-Rodriguez, Javier [1 ]
de Haro-Garcia, Aida [1 ]
机构
[1] Univ Cordoba, Dept Comp & Numer Anal, E-14071 Cordoba, Spain
关键词
Class-imbalance problem; instance selection; instance-based learning; very large problems; CLASSIFIERS; ALGORITHMS; REDUCTION; ENSEMBLES; RULE;
D O I
10.1109/TSMCB.2012.2206381
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In current research, an enormous amount of information is constantly being produced, which poses a challenge for data mining algorithms. Many of the problems in extremely active research areas, such as bioinformatics, security and intrusion detection, or text mining, share the following two features: large data sets and class-imbalanced distribution of samples. Although many methods have been proposed for dealing with class-imbalanced data sets, most of these methods are not scalable to the very large data sets common to those research fields. In this paper, we propose a new approach to dealing with the class-imbalance problem that is scalable to data sets with many millions of instances and hundreds of features. This proposal is based on the divide-and-conquer principle combined with application of the selection process to balanced subsets of the whole data set. This divide-and-conquer principle allows the execution of the algorithm in linear time. Furthermore, the proposed method is easy to implement using a parallel environment and can work without loading the whole data set into memory. Using 40 class-imbalanced medium-sized data sets, we will demonstrate our method's ability to improve the results of state-of-the-art instance selection methods for class-imbalanced data sets. Using three very large data sets, we will show the scalability of our proposal to millions of instances and hundreds of features.
引用
收藏
页码:332 / 346
页数:15
相关论文
共 50 条
  • [41] Prototypical Classifier for Robust Class-Imbalanced Learning
    Wei, Tong
    Shi, Jiang-Xin
    Li, Yu-Feng
    Zhang, Min-Ling
    ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PAKDD 2022, PT II, 2022, 13281 : 44 - 57
  • [42] Data cleaning issues in class imbalanced datasets: instance selection and missing values imputation for one-class classifiers
    Wang, Zhenyuan
    Tsai, Chih-Fong
    Lin, Wei-Chao
    DATA TECHNOLOGIES AND APPLICATIONS, 2021, 55 (05) : 771 - 787
  • [43] Cluster-Based Instance Selection for the Imbalanced Data Classification
    Czarnowski, Ireneusz
    Jedrzejowicz, Piotr
    COMPUTATIONAL COLLECTIVE INTELLIGENCE, ICCCI 2018, PT II, 2018, 11056 : 191 - 200
  • [44] Prediction of DTIs for high-dimensional and class-imbalanced data based on CGAN
    Yang, Kang
    Zhang, Zhongnan
    He, Song
    Bo, Xiaochen
    PROCEEDINGS 2018 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE (BIBM), 2018, : 788 - 791
  • [45] 2v-SSPC: A new classification method for class-imbalanced data
    Dept. of Applied Mathematics, Xidian Univ., Xi'an 710071, China
    不详
    不详
    Xi Tong Cheng Yu Dian Zi Ji Shu/Syst Eng Electron, 2008, 12 (2471-2476): : 2471 - 2476
  • [46] Robust Visual Recognition with Class-Imbalanced Open-World Noisy Data
    Zhao, Na
    Lee, Gim Hee
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 15, 2024, : 16989 - 16997
  • [47] A novel classification method for class-imbalanced data and its application in microRNA recognition
    Geng X.
    Zhu Y.-Q.
    Yang Z.
    International Journal Bioautomation, 2018, 22 (02) : 133 - 146
  • [48] Improved shrunken centroid classifiers for high-dimensional class-imbalanced data
    Rok Blagus
    Lara Lusa
    BMC Bioinformatics, 14
  • [49] SGBGAN: minority class image generation for class-imbalanced datasets
    Wan, Qian
    Guo, Wenhui
    Wang, Yanjiang
    MACHINE VISION AND APPLICATIONS, 2024, 35 (02)
  • [50] An Empirical Study on Preprocessing High-dimensional Class-imbalanced Data for Classification
    Yin, Hua
    Gai, Keke
    2015 IEEE 17TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS, 2015 IEEE 7TH INTERNATIONAL SYMPOSIUM ON CYBERSPACE SAFETY AND SECURITY, AND 2015 IEEE 12TH INTERNATIONAL CONFERENCE ON EMBEDDED SOFTWARE AND SYSTEMS (ICESS), 2015, : 1314 - 1319