Spark-based ensemble learning for imbalanced data classification

被引:0
|
作者
Ding J. [1 ]
Wang S. [1 ]
Jia L. [1 ]
You J. [1 ]
Jiang Y. [1 ]
机构
[1] Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming
基金
中国国家自然科学基金;
关键词
Comprehensive weight; Ensemble learning; Imbalanced data classification; Random forest; Spark;
D O I
10.23940/ijpe.18.05.p14.955964
中图分类号
学科分类号
摘要
With the rapid expansion of Big Data in all science and engineering domains, imbalanced data classification become a more acute problem in various real-world datasets. It is notably difficult to develop an efficient model by using mechanically the current data mining and machine learning algorithms. In this paper, we propose a Spark-based Ensemble Learning for imbalanced data classification approach (SELidc in short). The key point of SELidc lies in preprocessing to balance the imbalanced datasets, and to improve the performance and reduce fitting for the big and imbalanced data by building distributed ensemble learning algorithm. So, SELidc firstly converts the original imbalanced dataset into resilient distributed datasets. Next, in the sampling process, it samples by comprehensive weight, which is obtained in accordance with the weight of each class in majority class and the number of minority class samples. After that, it trains several classifiers with random forest in Spark environment by the correlation feature selection means. Experiments on publicly available UCI datasets and other datasets demonstrate that SELidc achieves more prominent results than other related approaches across various evaluation metrics, it makes full use of the efficient computing power of Spark distributed platform in training the massive data. © 2018 Totem Publisher, Inc. All rights reserved.
引用
收藏
页码:945 / 964
页数:19
相关论文
共 50 条
  • [31] Ensemble learning based predictive modelling on a highly imbalanced multiclass data
    Vasti, Manka
    Dev, Amita
    JOURNAL OF INFORMATION & OPTIMIZATION SCIENCES, 2024, 45 (08) : 2141 - 2164
  • [32] Adaptive ensemble of classifiers with regularization for imbalanced data classification
    Wang, Chen
    Deng, Chengyuan
    Yu, Zhoulu
    Hui, Dafeng
    Gong, Xiaofeng
    Luo, Ruisen
    INFORMATION FUSION, 2021, 69 : 81 - 102
  • [33] The classification of imbalanced large data sets based on MapReduce and ensemble of ELM classifiers
    Zhai, Junhai
    Zhang, Sufang
    Wang, Chenxi
    INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2017, 8 (03) : 1009 - 1017
  • [34] IMBALANCED DATA CLASSIFICATION BASED ON EXTREME LEARNING MACHINE AUTOENCODER
    Shen, Chu
    Zhang, Su-Fang
    Zhai, Jun-Hal
    Luo, Ding-Sheng
    Chen, Jun-Fen
    PROCEEDINGS OF 2018 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS (ICMLC), VOL 2, 2018, : 399 - 404
  • [35] A Spark-based parallel framework for geospatial raster data processing
    Gao, Fan
    Yue, Peng
    2018 7TH INTERNATIONAL CONFERENCE ON AGRO-GEOINFORMATICS (AGRO-GEOINFORMATICS), 2018, : 53 - 56
  • [36] Rarity updated ensemble with oversampling: An ensemble approach to classification of imbalanced data streams
    Nouri, Zahra
    Kiani, Vahid
    Fadishei, Hamid
    STATISTICAL ANALYSIS AND DATA MINING, 2024, 17 (01)
  • [37] Sample and feature selecting based ensemble learning for imbalanced problems
    Wang, Zhe
    Jia, Peng
    Xu, Xinlei
    Wang, Bolu
    Zhu, Yujin
    Li, Dongdong
    APPLIED SOFT COMPUTING, 2021, 113
  • [38] A Spark-Based Big Data Platform for Massive Remote Sensing Data Processing
    Sun, Zhongyi
    Chen, Fengke
    Chi, Mingmin
    Zhu, Yangyong
    DATA SCIENCE, 2015, 9208 : 120 - 126
  • [39] Attack Classification of Imbalanced Intrusion Data for IoT Network Using Ensemble-Learning-Based Deep Neural Network
    Thakkar, Ankit
    Lohiya, Ritika
    IEEE INTERNET OF THINGS JOURNAL, 2023, 10 (13) : 11888 - 11895
  • [40] Noise Avoidance SMOTE in Ensemble Learning for Imbalanced Data
    Kim, Kyoungok
    IEEE ACCESS, 2021, 9 : 143250 - 143265