Experimental evaluation of ensemble classifiers for imbalance in Big Data

被引:18
|
作者
Juez-Gil M. [1 ]
Arnaiz-González Á. [1 ]
Rodríguez J.J. [1 ]
García-Osorio C. [1 ]
机构
[1] Escuela Politécnica Superior, University of Burgos, Burgos
关键词
Big Data; Ensemble; Imbalance; Resampling; Spark; Unbalance;
D O I
10.1016/j.asoc.2021.107447
中图分类号
学科分类号
摘要
Datasets are growing in size and complexity at a pace never seen before, forming ever larger datasets known as Big Data. A common problem for classification, especially in Big Data, is that the numerous examples of the different classes might not be balanced. Some decades ago, imbalanced classification was therefore introduced, to correct the tendency of classifiers that show bias in favor of the majority class and that ignore the minority one. To date, although the number of imbalanced classification methods have increased, they continue to focus on normal-sized datasets and not on the new reality of Big Data. In this paper, in-depth experimentation with ensemble classifiers is conducted in the context of imbalanced Big Data classification, using two popular ensemble families (Bagging and Boosting) and different resampling methods. All the experimentation was launched in Spark clusters, comparing ensemble performance and execution times with statistical test results, including the newest ones based on the Bayesian approach. One very interesting conclusion from the study was that simpler methods applied to unbalanced datasets in the context of Big Data provided better results than complex methods. The additional complexity of some of the sophisticated methods, which appear necessary to process and to reduce imbalance in normal-sized datasets were not effective for imbalanced Big Data. © 2021 The Author(s)
引用
收藏
相关论文
共 50 条
  • [1] Large Iterative Multitier Ensemble Classifiers for Security of Big Data
    Abawajy, Jemal H.
    Kelarev, Andrei
    Chowdhury, Morshed
    IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTING, 2014, 2 (03) : 352 - 363
  • [2] Hybrid Consensus Pruning of Ensemble Classifiers for Big Data Malware Detection
    Abawajy, Jemal H.
    Chowdhury, Morshed
    Kelarev, Andrei
    IEEE TRANSACTIONS ON CLOUD COMPUTING, 2020, 8 (02) : 398 - 407
  • [3] Big data classification using heterogeneous ensemble classifiers in Apache Spark based on MapReduce paradigm
    Kadkhodaei, Hamidreza
    Moghadam, Amir Masoud Eftekhari
    Dehghan, Mehdi
    EXPERT SYSTEMS WITH APPLICATIONS, 2021, 183
  • [4] Self-paced ensemble and big data identification: a classification of substantial imbalance computational analysis
    Shahzadi Bano
    Weimei Zhi
    Baozhi Qiu
    Muhammad Raza
    Nabila Sehito
    Mian Muhammad Kamal
    Ghadah Aldehim
    Nuha Alruwais
    The Journal of Supercomputing, 2024, 80 : 9848 - 9869
  • [5] Self-paced ensemble and big data identification: a classification of substantial imbalance computational analysis
    Bano, Shahzadi
    Zhi, Weimei
    Qiu, Baozhi
    Raza, Muhammad
    Sehito, Nabila
    Kamal, Mian Muhammad
    Aldehim, Ghadah
    Alruwais, Nuha
    JOURNAL OF SUPERCOMPUTING, 2024, 80 (07) : 9848 - 9869
  • [6] Combining Sampling and Ensemble Classifier for Multiclass Imbalance Data Learning
    Sainin, Mohd Shamrie
    Alfred, Rayner
    Adnan, Fairuz
    Ahmad, Faudziah
    COMPUTATIONAL SCIENCE AND TECHNOLOGY, ICCST 2017, 2018, 488 : 262 - 272
  • [7] From Big to Smart Data: Iterative ensemble filter for noise filtering in Big Data classification
    Garcia-Gil, Diego
    Luque-Sanchez, Francisco
    Luengo, Julian
    Garcia, Salvador
    Herrera, Francisco
    INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS, 2019, 34 (12) : 3260 - 3274
  • [8] Ensemble with Divisive Bagging for Feature Selection in Big Data
    Park, Yousung
    Kwon, Tae Yeon
    COMPUTATIONAL ECONOMICS, 2024,
  • [9] Feature Selection and Ensemble Meta Classifier for Multiclass Imbalance Data Learning
    Sainin, Mohd Shamrie
    Alfred, Rayner
    Alias, Suraya
    Lammasha, Mohamed A. M.
    PROCEEDINGS OF KNOWLEDGE MANAGEMENT INTERNATIONAL CONFERENCE (KMICE) 2018, 2018, : 134 - 139
  • [10] Big data processing tools: An experimental performance evaluation
    Rodrigues, Mario
    Santos, Maribel Yasmina
    Bernardino, Jorge
    WILEY INTERDISCIPLINARY REVIEWS-DATA MINING AND KNOWLEDGE DISCOVERY, 2019, 9 (02)