Spark-based ensemble learning for imbalanced data classification

被引:0
|
作者
Ding J. [1 ]
Wang S. [1 ]
Jia L. [1 ]
You J. [1 ]
Jiang Y. [1 ]
机构
[1] Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming
基金
中国国家自然科学基金;
关键词
Comprehensive weight; Ensemble learning; Imbalanced data classification; Random forest; Spark;
D O I
10.23940/ijpe.18.05.p14.955964
中图分类号
学科分类号
摘要
With the rapid expansion of Big Data in all science and engineering domains, imbalanced data classification become a more acute problem in various real-world datasets. It is notably difficult to develop an efficient model by using mechanically the current data mining and machine learning algorithms. In this paper, we propose a Spark-based Ensemble Learning for imbalanced data classification approach (SELidc in short). The key point of SELidc lies in preprocessing to balance the imbalanced datasets, and to improve the performance and reduce fitting for the big and imbalanced data by building distributed ensemble learning algorithm. So, SELidc firstly converts the original imbalanced dataset into resilient distributed datasets. Next, in the sampling process, it samples by comprehensive weight, which is obtained in accordance with the weight of each class in majority class and the number of minority class samples. After that, it trains several classifiers with random forest in Spark environment by the correlation feature selection means. Experiments on publicly available UCI datasets and other datasets demonstrate that SELidc achieves more prominent results than other related approaches across various evaluation metrics, it makes full use of the efficient computing power of Spark distributed platform in training the massive data. © 2018 Totem Publisher, Inc. All rights reserved.
引用
收藏
页码:945 / 964
页数:19
相关论文
共 50 条
  • [11] Ensemble Approach for the Classification of Imbalanced Data
    Nikulin, Vladimir
    McLachlan, Geoffrey J.
    Ng, Shu Kay
    AI 2009: ADVANCES IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2009, 5866 : 291 - +
  • [12] Imbalanced Learning of Fault Data Combined with Cloud Model and Ensemble Classification
    Ma S.
    Zhao R.
    Wu Y.
    Zhendong Ceshi Yu Zhenduan/Journal of Vibration, Measurement and Diagnosis, 2023, 43 (06): : 1114 - 1120and1243
  • [13] Multi-class imbalanced big data classification on Spark
    Sleeman, William C.
    Krawczyk, Bartosz
    KNOWLEDGE-BASED SYSTEMS, 2021, 212
  • [14] EMRIL: Ensemble Method based on ReInforcement Learning for binary classification in imbalanced drifting data streams
    Usman, Muhammad
    Chen, Huanhuan
    NEUROCOMPUTING, 2024, 605
  • [15] Ensemble Learning on Large Scale Financial Imbalanced Data
    Sanabila, H. R.
    Jatmiko, Wisnu
    2018 INTERNATIONAL WORKSHOP ON BIG DATA AND INFORMATION SECURITY (IWBIS), 2018, : 93 - 98
  • [16] Dynamic Ensemble Framework for Imbalanced Data Classification
    Zhu, Tuanfei
    Hu, Xingchen
    Liu, Xinwang
    Zhu, En
    Zhu, Xinzhong
    Xu, Huiying
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2025, 37 (05) : 2456 - 2471
  • [17] A Robust Enhanced Ensemble Learning Method for Breast Cancer Data Diagnosis on Imbalanced Data
    Wang, Zhenzhen
    Xie, Junde
    Zhang, Jia
    IEEE ACCESS, 2024, 12 : 189776 - 189788
  • [18] Leveraging ensemble pruning for imbalanced data classification
    Krawczyk, Bartosz
    Wozniak, Michal
    2018 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS (SMC), 2018, : 439 - 444
  • [19] Equalization ensemble for large scale highly imbalanced data classification
    Ren, Jinjun
    Wang, Yuping
    Mao, Mingqian
    Cheung, Yiu-ming
    KNOWLEDGE-BASED SYSTEMS, 2022, 242
  • [20] RGAN-EL: A GAN and ensemble learning-based hybrid approach for imbalanced data classification
    Ding, Hongwei
    Sun, Yu
    Wang, Zhenyu
    Huang, Nana
    Shen, Zhidong
    Cui, Xiaohui
    INFORMATION PROCESSING & MANAGEMENT, 2023, 60 (02)