Experimental evaluation of ensemble classifiers for imbalance in Big Data

被引:18
|
作者
Juez-Gil M. [1 ]
Arnaiz-González Á. [1 ]
Rodríguez J.J. [1 ]
García-Osorio C. [1 ]
机构
[1] Escuela Politécnica Superior, University of Burgos, Burgos
关键词
Big Data; Ensemble; Imbalance; Resampling; Spark; Unbalance;
D O I
10.1016/j.asoc.2021.107447
中图分类号
学科分类号
摘要
Datasets are growing in size and complexity at a pace never seen before, forming ever larger datasets known as Big Data. A common problem for classification, especially in Big Data, is that the numerous examples of the different classes might not be balanced. Some decades ago, imbalanced classification was therefore introduced, to correct the tendency of classifiers that show bias in favor of the majority class and that ignore the minority one. To date, although the number of imbalanced classification methods have increased, they continue to focus on normal-sized datasets and not on the new reality of Big Data. In this paper, in-depth experimentation with ensemble classifiers is conducted in the context of imbalanced Big Data classification, using two popular ensemble families (Bagging and Boosting) and different resampling methods. All the experimentation was launched in Spark clusters, comparing ensemble performance and execution times with statistical test results, including the newest ones based on the Bayesian approach. One very interesting conclusion from the study was that simpler methods applied to unbalanced datasets in the context of Big Data provided better results than complex methods. The additional complexity of some of the sophisticated methods, which appear necessary to process and to reduce imbalance in normal-sized datasets were not effective for imbalanced Big Data. © 2021 The Author(s)
引用
收藏
相关论文
共 50 条
  • [31] A Classifier Ensemble Framework for Multimedia Big Data Classification
    Yan, Yilin
    Zhu, Qiusha
    Shyu, Mei-Ling
    Chen, Shu-Ching
    PROCEEDINGS OF 2016 IEEE 17TH INTERNATIONAL CONFERENCE ON INFORMATION REUSE AND INTEGRATION (IEEE IRI), 2016, : 615 - 622
  • [32] Empirical Analysis of Asymptotic Ensemble Learning for Big Data
    Salloum, Salman
    Huang, Joshua Zhexue
    He, Yulin
    2016 3RD IEEE/ACM INTERNATIONAL CONFERENCE ON BIG DATA COMPUTING, APPLICATIONS AND TECHNOLOGIES (BDCAT), 2016, : 8 - 17
  • [33] Output Thresholding for Ensemble Learners and Imbalanced Big Data
    Johnson, Justin M.
    Khoshgoftaar, Taghi M.
    2021 IEEE 33RD INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI 2021), 2021, : 1449 - 1454
  • [34] Big Ensemble Data Assimilation in Numerical Weather Prediction
    Miyoshi, Takemasa
    Kondo, Keiichi
    Terasaki, Koji
    COMPUTER, 2015, 48 (11) : 15 - 21
  • [35] An Ensemble approach to Big Data Security (Cyber Security)
    Hashmani, Manzoor Ahmed
    Jameel, Syed Muslim
    Ibrahim, Aidarus M.
    Zaffar, Maryam
    Raza, Kamran
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2018, 9 (09) : 75 - 77
  • [36] An Experimental Study of A Biosequence Big Data Analysis Service
    Zhou, Wei
    Liu, Ling
    Pu, Calton
    Zhu, Tao
    Wang, Qingyang
    Xiang, Wenkun
    Yao, Shaowen
    2017 IEEE 24TH INTERNATIONAL CONFERENCE ON WEB SERVICES (ICWS 2017), 2017, : 237 - 244
  • [37] Ensemble framework for concept drift detection and class imbalance in data streams
    S P.
    R A.U.
    Multimedia Tools and Applications, 2025, 84 (11) : 8823 - 8837
  • [38] Multi-step forecasting for big data time series based on ensemble learning
    Galicia, A.
    Talavera-Llames, R.
    Troncoso, A.
    Koprinska, I.
    Martinez-Alvarez, F.
    KNOWLEDGE-BASED SYSTEMS, 2019, 163 : 830 - 841
  • [39] Evaluation and the Big Data Challenge
    Picciotto, Robert
    AMERICAN JOURNAL OF EVALUATION, 2020, 41 (02) : 166 - 181
  • [40] Ensemble classifiers using multi-objective Genetic Programming for unbalanced data
    Meng, Wenyang
    Li, Ying
    Gao, Xiaoying
    Ma, Jianbin
    APPLIED SOFT COMPUTING, 2024, 158