Spark-based ensemble learning for imbalanced data classification

被引:0
|
作者
Ding J. [1 ]
Wang S. [1 ]
Jia L. [1 ]
You J. [1 ]
Jiang Y. [1 ]
机构
[1] Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming
基金
中国国家自然科学基金;
关键词
Comprehensive weight; Ensemble learning; Imbalanced data classification; Random forest; Spark;
D O I
10.23940/ijpe.18.05.p14.955964
中图分类号
学科分类号
摘要
With the rapid expansion of Big Data in all science and engineering domains, imbalanced data classification become a more acute problem in various real-world datasets. It is notably difficult to develop an efficient model by using mechanically the current data mining and machine learning algorithms. In this paper, we propose a Spark-based Ensemble Learning for imbalanced data classification approach (SELidc in short). The key point of SELidc lies in preprocessing to balance the imbalanced datasets, and to improve the performance and reduce fitting for the big and imbalanced data by building distributed ensemble learning algorithm. So, SELidc firstly converts the original imbalanced dataset into resilient distributed datasets. Next, in the sampling process, it samples by comprehensive weight, which is obtained in accordance with the weight of each class in majority class and the number of minority class samples. After that, it trains several classifiers with random forest in Spark environment by the correlation feature selection means. Experiments on publicly available UCI datasets and other datasets demonstrate that SELidc achieves more prominent results than other related approaches across various evaluation metrics, it makes full use of the efficient computing power of Spark distributed platform in training the massive data. © 2018 Totem Publisher, Inc. All rights reserved.
引用
收藏
页码:945 / 964
页数:19
相关论文
共 50 条
  • [41] An Effective Sampling Strategy for Ensemble Learning with Imbalanced Data
    Zhang, Chen
    Zhang, Xiaolong
    INTELLIGENT COMPUTING METHODOLOGIES, ICIC 2017, PT III, 2017, 10363 : 377 - 388
  • [42] Big data classification of learning behaviour based on data reduction and ensemble learning
    Wang, Taotao
    Wu, Xiaoxuan
    INTERNATIONAL JOURNAL OF CONTINUING ENGINEERING EDUCATION AND LIFE-LONG LEARNING, 2023, 33 (4-5) : 496 - 510
  • [43] An Ensemble Learning Algorithm Based on Density Peaks Clustering and Fitness for Imbalanced Data
    Xu, Hui
    Liu, Qicheng
    IEEE ACCESS, 2022, 10 : 116120 - 116128
  • [44] Evolutionary under-sampling based bagging ensemble method for imbalanced data classification
    Bo Sun
    Haiyan Chen
    Jiandong Wang
    Hua Xie
    Frontiers of Computer Science, 2018, 12 : 331 - 350
  • [45] Spark-based Spatial Association Mining
    Binzani, Kanika
    Yoo, Jin Soung
    2018 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2018, : 5300 - 5301
  • [46] Cost-Sensitive Ensemble Learning for Highly Imbalanced Classification
    Johnson, Justin M.
    Khoshgoftaar, Taghi M.
    2022 21ST IEEE INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS, ICMLA, 2022, : 1427 - 1434
  • [47] Evolutionary under-sampling based bagging ensemble method for imbalanced data classification
    Sun, Bo
    Chen, Haiyan
    Wang, Jiandong
    Xie, Hua
    FRONTIERS OF COMPUTER SCIENCE, 2018, 12 (02) : 331 - 350
  • [48] Classifier Ensemble Based on Multiview Optimization for High-Dimensional Imbalanced Data Classification
    Xu, Yuhong
    Yu, Zhiwen
    Chen, C. L. Philip
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (01) : 870 - 883
  • [49] MapReduce and Spark-based architecture for bi-class classification using SVM
    Giraldo, Mario A.
    Duitama, John F.
    Arias-Londono, Julian D.
    2018 IEEE 1ST COLOMBIAN CONFERENCE ON APPLICATIONS IN COMPUTATIONAL INTELLIGENCE (COLCACI), 2018,
  • [50] A New Approach for Imbalanced Data Classification Based on Minimize Loss Learning
    Zhang, Chunkai
    Wang, Guoquan
    Zhou, Ying
    Jiang, Jiayao
    2017 IEEE SECOND INTERNATIONAL CONFERENCE ON DATA SCIENCE IN CYBERSPACE (DSC), 2017, : 82 - 87