Experimental evaluation of ensemble classifiers for imbalance in Big Data

被引:18
|
作者
Juez-Gil M. [1 ]
Arnaiz-González Á. [1 ]
Rodríguez J.J. [1 ]
García-Osorio C. [1 ]
机构
[1] Escuela Politécnica Superior, University of Burgos, Burgos
关键词
Big Data; Ensemble; Imbalance; Resampling; Spark; Unbalance;
D O I
10.1016/j.asoc.2021.107447
中图分类号
学科分类号
摘要
Datasets are growing in size and complexity at a pace never seen before, forming ever larger datasets known as Big Data. A common problem for classification, especially in Big Data, is that the numerous examples of the different classes might not be balanced. Some decades ago, imbalanced classification was therefore introduced, to correct the tendency of classifiers that show bias in favor of the majority class and that ignore the minority one. To date, although the number of imbalanced classification methods have increased, they continue to focus on normal-sized datasets and not on the new reality of Big Data. In this paper, in-depth experimentation with ensemble classifiers is conducted in the context of imbalanced Big Data classification, using two popular ensemble families (Bagging and Boosting) and different resampling methods. All the experimentation was launched in Spark clusters, comparing ensemble performance and execution times with statistical test results, including the newest ones based on the Bayesian approach. One very interesting conclusion from the study was that simpler methods applied to unbalanced datasets in the context of Big Data provided better results than complex methods. The additional complexity of some of the sophisticated methods, which appear necessary to process and to reduce imbalance in normal-sized datasets were not effective for imbalanced Big Data. © 2021 The Author(s)
引用
收藏
相关论文
共 50 条
  • [41] An Evaluation of Big Data Architectures
    Garises, Valerie
    Quenum, Jose G.
    PROCEEDINGS OF THE 8TH INTERNATIONAL CONFERENCE ON DATA SCIENCE, TECHNOLOGY AND APPLICATIONS (DATA), 2019, : 152 - 159
  • [42] An evolutionary algorithm approach to optimal ensemble classifiers for DNA microarray data analysis
    Kim, Kyung-Joong
    Cho, Sung-Bae
    IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, 2008, 12 (03) : 377 - 388
  • [43] Data Sampling Methods to Deal With the Big Data Multi-Class Imbalance Problem
    Rendon, Erendira
    Alejo, Roberto
    Castorena, Carlos
    Isidro-Ortega, Frank J.
    Granda-Gutierrez, Everardo E.
    APPLIED SCIENCES-BASEL, 2020, 10 (04):
  • [44] The classification of imbalanced large data sets based on MapReduce and ensemble of ELM classifiers
    Zhai, Junhai
    Zhang, Sufang
    Wang, Chenxi
    INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2017, 8 (03) : 1009 - 1017
  • [45] The IPTV Video Evaluation Model Based on Big Data
    Yu, Longfeng
    Gu, Junhua
    Wang, Shoubin
    Zhang, Suqi
    2017 IEEE 2ND INTERNATIONAL CONFERENCE ON BIG DATA ANALYSIS (ICBDA), 2017, : 164 - 168
  • [46] Optimizing Ensemble Trees for Big Data Healthcare Fraud Detection
    Hancock, John
    Khoshgoftaar, Taghi M.
    2022 IEEE 23RD INTERNATIONAL CONFERENCE ON INFORMATION REUSE AND INTEGRATION FOR DATA SCIENCE (IRI 2022), 2022, : 243 - 249
  • [47] Improving malware detection using big data and ensemble learning
    Gupta, Deepak
    Rani, Rinkle
    COMPUTERS & ELECTRICAL ENGINEERING, 2020, 86
  • [48] Investigation on the use of ensemble learning and big data in crop identification
    Ahmed, Sayed
    Mahmoud, Amira S.
    Farg, Eslam
    Mohamed, Amany M.
    Moustafa, Marwa S.
    Abutaleb, Khaled
    Saleh, Ahmed M.
    AbdelRahman, Mohamed A. E.
    AbdelSalam, Hisham M.
    Arafat, Sayed M.
    HELIYON, 2023, 9 (02)
  • [49] Intrusion detection based on ensemble learning for big data classification
    Jemili, Farah
    Meddeb, Rahma
    Korbaa, Ouajdi
    CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2024, 27 (03): : 3771 - 3798
  • [50] Big SQL systems: an experimental evaluation
    Victor Aluko
    Sherif Sakr
    Cluster Computing, 2019, 22 : 1347 - 1377