Experimental evaluation of ensemble classifiers for imbalance in Big Data

被引：18

作者：

Juez-Gil M. ^{[1
]}

Arnaiz-González Á. ^{[1
]}

Rodríguez J.J. ^{[1
]}

García-Osorio C. ^{[1
]}

机构：

[1] Escuela Politécnica Superior, University of Burgos, Burgos

来源：

Applied Soft Computing | 2021年 / 108卷

关键词：

Big Data; Ensemble; Imbalance; Resampling; Spark; Unbalance;

D O I：

10.1016/j.asoc.2021.107447

中图分类号：

学科分类号：

摘要：

Datasets are growing in size and complexity at a pace never seen before, forming ever larger datasets known as Big Data. A common problem for classification, especially in Big Data, is that the numerous examples of the different classes might not be balanced. Some decades ago, imbalanced classification was therefore introduced, to correct the tendency of classifiers that show bias in favor of the majority class and that ignore the minority one. To date, although the number of imbalanced classification methods have increased, they continue to focus on normal-sized datasets and not on the new reality of Big Data. In this paper, in-depth experimentation with ensemble classifiers is conducted in the context of imbalanced Big Data classification, using two popular ensemble families (Bagging and Boosting) and different resampling methods. All the experimentation was launched in Spark clusters, comparing ensemble performance and execution times with statistical test results, including the newest ones based on the Bayesian approach. One very interesting conclusion from the study was that simpler methods applied to unbalanced datasets in the context of Big Data provided better results than complex methods. The additional complexity of some of the sophisticated methods, which appear necessary to process and to reduce imbalance in normal-sized datasets were not effective for imbalanced Big Data. © 2021 The Author(s)

引用

共 50 条

[41] An Evaluation of Big Data Architectures
Garises, Valerie
Quenum, Jose G.
PROCEEDINGS OF THE 8TH INTERNATIONAL CONFERENCE ON DATA SCIENCE, TECHNOLOGY AND APPLICATIONS (DATA), 2019, : 152 - 159
[42] An evolutionary algorithm approach to optimal ensemble classifiers for DNA microarray data analysis
Kim, Kyung-Joong
Cho, Sung-Bae
IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, 2008, 12 (03) : 377 - 388
[43] Data Sampling Methods to Deal With the Big Data Multi-Class Imbalance Problem
Rendon, Erendira
Alejo, Roberto
Castorena, Carlos
Isidro-Ortega, Frank J.
Granda-Gutierrez, Everardo E.
APPLIED SCIENCES-BASEL, 2020, 10 (04):
[44] The classification of imbalanced large data sets based on MapReduce and ensemble of ELM classifiers
Zhai, Junhai
Zhang, Sufang
Wang, Chenxi
INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2017, 8 (03) : 1009 - 1017
[45] The IPTV Video Evaluation Model Based on Big Data
Yu, Longfeng
Gu, Junhua
Wang, Shoubin
Zhang, Suqi
2017 IEEE 2ND INTERNATIONAL CONFERENCE ON BIG DATA ANALYSIS (ICBDA), 2017, : 164 - 168
[46] Optimizing Ensemble Trees for Big Data Healthcare Fraud Detection
Hancock, John
Khoshgoftaar, Taghi M.
2022 IEEE 23RD INTERNATIONAL CONFERENCE ON INFORMATION REUSE AND INTEGRATION FOR DATA SCIENCE (IRI 2022), 2022, : 243 - 249
[47] Improving malware detection using big data and ensemble learning
Gupta, Deepak
Rani, Rinkle
COMPUTERS & ELECTRICAL ENGINEERING, 2020, 86
[48] Investigation on the use of ensemble learning and big data in crop identification
Ahmed, Sayed
Mahmoud, Amira S.
Farg, Eslam
Mohamed, Amany M.
Moustafa, Marwa S.
Abutaleb, Khaled
Saleh, Ahmed M.
AbdelRahman, Mohamed A. E.
AbdelSalam, Hisham M.
Arafat, Sayed M.
HELIYON, 2023, 9 (02)
[49] Intrusion detection based on ensemble learning for big data classification
Jemili, Farah
Meddeb, Rahma
Korbaa, Ouajdi
CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2024, 27 (03): : 3771 - 3798
[50] Big SQL systems: an experimental evaluation
Victor Aluko
Sherif Sakr
Cluster Computing, 2019, 22 : 1347 - 1377

← 1 2 3 4 5 →