Experimental evaluation of ensemble classifiers for imbalance in Big Data

被引：18

作者：

Juez-Gil M. ^{[1
]}

Arnaiz-González Á. ^{[1
]}

Rodríguez J.J. ^{[1
]}

García-Osorio C. ^{[1
]}

机构：

[1] Escuela Politécnica Superior, University of Burgos, Burgos

来源：

Applied Soft Computing | 2021年 / 108卷

关键词：

Big Data; Ensemble; Imbalance; Resampling; Spark; Unbalance;

D O I：

10.1016/j.asoc.2021.107447

中图分类号：

学科分类号：

摘要：

Datasets are growing in size and complexity at a pace never seen before, forming ever larger datasets known as Big Data. A common problem for classification, especially in Big Data, is that the numerous examples of the different classes might not be balanced. Some decades ago, imbalanced classification was therefore introduced, to correct the tendency of classifiers that show bias in favor of the majority class and that ignore the minority one. To date, although the number of imbalanced classification methods have increased, they continue to focus on normal-sized datasets and not on the new reality of Big Data. In this paper, in-depth experimentation with ensemble classifiers is conducted in the context of imbalanced Big Data classification, using two popular ensemble families (Bagging and Boosting) and different resampling methods. All the experimentation was launched in Spark clusters, comparing ensemble performance and execution times with statistical test results, including the newest ones based on the Bayesian approach. One very interesting conclusion from the study was that simpler methods applied to unbalanced datasets in the context of Big Data provided better results than complex methods. The additional complexity of some of the sophisticated methods, which appear necessary to process and to reduce imbalance in normal-sized datasets were not effective for imbalanced Big Data. © 2021 The Author(s)

引用

共 50 条

[1] Large Iterative Multitier Ensemble Classifiers for Security of Big Data
Abawajy, Jemal H.
Kelarev, Andrei
Chowdhury, Morshed
IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTING, 2014, 2 (03) : 352 - 363
[2] Hybrid Consensus Pruning of Ensemble Classifiers for Big Data Malware Detection
Abawajy, Jemal H.
Chowdhury, Morshed
Kelarev, Andrei
IEEE TRANSACTIONS ON CLOUD COMPUTING, 2020, 8 (02) : 398 - 407
[3] Big data classification using heterogeneous ensemble classifiers in Apache Spark based on MapReduce paradigm
Kadkhodaei, Hamidreza
Moghadam, Amir Masoud Eftekhari
Dehghan, Mehdi
EXPERT SYSTEMS WITH APPLICATIONS, 2021, 183
[4] Self-paced ensemble and big data identification: a classification of substantial imbalance computational analysis
Shahzadi Bano
Weimei Zhi
Baozhi Qiu
Muhammad Raza
Nabila Sehito
Mian Muhammad Kamal
Ghadah Aldehim
Nuha Alruwais
The Journal of Supercomputing, 2024, 80 : 9848 - 9869
[5] Self-paced ensemble and big data identification: a classification of substantial imbalance computational analysis
Bano, Shahzadi
Zhi, Weimei
Qiu, Baozhi
Raza, Muhammad
Sehito, Nabila
Kamal, Mian Muhammad
Aldehim, Ghadah
Alruwais, Nuha
JOURNAL OF SUPERCOMPUTING, 2024, 80 (07) : 9848 - 9869
[6] Combining Sampling and Ensemble Classifier for Multiclass Imbalance Data Learning
Sainin, Mohd Shamrie
Alfred, Rayner
Adnan, Fairuz
Ahmad, Faudziah
COMPUTATIONAL SCIENCE AND TECHNOLOGY, ICCST 2017, 2018, 488 : 262 - 272
[7] From Big to Smart Data: Iterative ensemble filter for noise filtering in Big Data classification
Garcia-Gil, Diego
Luque-Sanchez, Francisco
Luengo, Julian
Garcia, Salvador
Herrera, Francisco
INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS, 2019, 34 (12) : 3260 - 3274
[8] Ensemble with Divisive Bagging for Feature Selection in Big Data
Park, Yousung
Kwon, Tae Yeon
COMPUTATIONAL ECONOMICS, 2024,
[9] Feature Selection and Ensemble Meta Classifier for Multiclass Imbalance Data Learning
Sainin, Mohd Shamrie
Alfred, Rayner
Alias, Suraya
Lammasha, Mohamed A. M.
PROCEEDINGS OF KNOWLEDGE MANAGEMENT INTERNATIONAL CONFERENCE (KMICE) 2018, 2018, : 134 - 139
[10] Big data processing tools: An experimental performance evaluation
Rodrigues, Mario
Santos, Maribel Yasmina
Bernardino, Jorge
WILEY INTERDISCIPLINARY REVIEWS-DATA MINING AND KNOWLEDGE DISCOVERY, 2019, 9 (02)

← 1 2 3 4 5 →