Experimental evaluation of ensemble classifiers for imbalance in Big Data

被引：18

作者：

Juez-Gil M. ^{[1
]}

Arnaiz-González Á. ^{[1
]}

Rodríguez J.J. ^{[1
]}

García-Osorio C. ^{[1
]}

机构：

[1] Escuela Politécnica Superior, University of Burgos, Burgos

来源：

Applied Soft Computing | 2021年 / 108卷

关键词：

Big Data; Ensemble; Imbalance; Resampling; Spark; Unbalance;

D O I：

10.1016/j.asoc.2021.107447

中图分类号：

学科分类号：

摘要：

Datasets are growing in size and complexity at a pace never seen before, forming ever larger datasets known as Big Data. A common problem for classification, especially in Big Data, is that the numerous examples of the different classes might not be balanced. Some decades ago, imbalanced classification was therefore introduced, to correct the tendency of classifiers that show bias in favor of the majority class and that ignore the minority one. To date, although the number of imbalanced classification methods have increased, they continue to focus on normal-sized datasets and not on the new reality of Big Data. In this paper, in-depth experimentation with ensemble classifiers is conducted in the context of imbalanced Big Data classification, using two popular ensemble families (Bagging and Boosting) and different resampling methods. All the experimentation was launched in Spark clusters, comparing ensemble performance and execution times with statistical test results, including the newest ones based on the Bayesian approach. One very interesting conclusion from the study was that simpler methods applied to unbalanced datasets in the context of Big Data provided better results than complex methods. The additional complexity of some of the sophisticated methods, which appear necessary to process and to reduce imbalance in normal-sized datasets were not effective for imbalanced Big Data. © 2021 The Author(s)

引用

共 50 条

[31] A Classifier Ensemble Framework for Multimedia Big Data Classification
Yan, Yilin
Zhu, Qiusha
Shyu, Mei-Ling
Chen, Shu-Ching
PROCEEDINGS OF 2016 IEEE 17TH INTERNATIONAL CONFERENCE ON INFORMATION REUSE AND INTEGRATION (IEEE IRI), 2016, : 615 - 622
[32] Empirical Analysis of Asymptotic Ensemble Learning for Big Data
Salloum, Salman
Huang, Joshua Zhexue
He, Yulin
2016 3RD IEEE/ACM INTERNATIONAL CONFERENCE ON BIG DATA COMPUTING, APPLICATIONS AND TECHNOLOGIES (BDCAT), 2016, : 8 - 17
[33] Output Thresholding for Ensemble Learners and Imbalanced Big Data
Johnson, Justin M.
Khoshgoftaar, Taghi M.
2021 IEEE 33RD INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI 2021), 2021, : 1449 - 1454
[34] Big Ensemble Data Assimilation in Numerical Weather Prediction
Miyoshi, Takemasa
Kondo, Keiichi
Terasaki, Koji
COMPUTER, 2015, 48 (11) : 15 - 21
[35] An Ensemble approach to Big Data Security (Cyber Security)
Hashmani, Manzoor Ahmed
Jameel, Syed Muslim
Ibrahim, Aidarus M.
Zaffar, Maryam
Raza, Kamran
INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2018, 9 (09) : 75 - 77
[36] An Experimental Study of A Biosequence Big Data Analysis Service
Zhou, Wei
Liu, Ling
Pu, Calton
Zhu, Tao
Wang, Qingyang
Xiang, Wenkun
Yao, Shaowen
2017 IEEE 24TH INTERNATIONAL CONFERENCE ON WEB SERVICES (ICWS 2017), 2017, : 237 - 244
[37] Ensemble framework for concept drift detection and class imbalance in data streams
S P.
R A.U.
Multimedia Tools and Applications, 2025, 84 (11) : 8823 - 8837
[38] Multi-step forecasting for big data time series based on ensemble learning
Galicia, A.
Talavera-Llames, R.
Troncoso, A.
Koprinska, I.
Martinez-Alvarez, F.
KNOWLEDGE-BASED SYSTEMS, 2019, 163 : 830 - 841
[39] Evaluation and the Big Data Challenge
Picciotto, Robert
AMERICAN JOURNAL OF EVALUATION, 2020, 41 (02) : 166 - 181
[40] Ensemble classifiers using multi-objective Genetic Programming for unbalanced data
Meng, Wenyang
Li, Ying
Gao, Xiaoying
Ma, Jianbin
APPLIED SOFT COMPUTING, 2024, 158

← 1 2 3 4 5 →