Predictive modelling of MapReduce job performance in cloud environments using machine learning techniques

被引:0
作者
Bergui, Mohammed [1 ]
Hourri, Soufiane [1 ,2 ]
Najah, Said [1 ]
Nikolov, Nikola S. [3 ]
机构
[1] Univ Sidi Mohammed Ben Abdellah, Fac Sci & Technol, Dept Comp Sci, Lab Intelligent Syst & Applicat, Fes, Morocco
[2] Univ Cadi Ayyad, Higher Sch Technol, Lab Proc Ind Signals & Comp Sci, Safi, Morocco
[3] Univ Limerick, Dept Comp Sci & Informat Syst, Limerick, Ireland
关键词
Hadoop; MapReduce; Big data; Performance modelling; Runtime prediction; Machine learning;
D O I
10.1186/s40537-024-00964-z
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Within the Hadoop ecosystem, MapReduce stands as a cornerstone for managing, processing, and mining large-scale datasets. Yet, the absence of efficient solutions for precise estimation of job execution times poses a persistent challenge, impacting task allocation and distribution within Hadoop clusters. In this study, we present a comprehensive machine learning approach for predicting the execution time of MapReduce jobs, encompassing data collection, preprocessing, feature engineering, and model evaluation. Leveraging a rich dataset derived from comprehensive Hadoop MapReduce job traces, we explore the intricate relationship between cluster parameters and job performance. Through a comparative analysis of machine learning models, including linear regression, decision tree, random forest, and gradient-boosted regression trees, we identify the random forest model as the most effective, demonstrating superior predictive accuracy and robustness. Our findings underscore the critical role of features such as data size and resource allocation in determining job performance. With this work, we aim to enhance resource management efficiency and enable more effective utilisation of cloud-based Hadoop clusters for large-scale data processing tasks.
引用
收藏
页数:20
相关论文
共 32 条
  • [11] Machine Learning-Based Configuration Parameter Tuning on Hadoop System
    Chen, Chi-Ou
    Zhuo, Ye-Qi
    Yeh, Chao-Chun
    Lin, Che-Min
    Liao, Shih-wei
    [J]. 2015 IEEE INTERNATIONAL CONGRESS ON BIG DATA - BIGDATA CONGRESS 2015, 2015, : 386 - 392
  • [12] Mapreduce: Simplified data processing on large clusters
    Dean, Jeffrey
    Ghemawat, Sanjay
    [J]. COMMUNICATIONS OF THE ACM, 2008, 51 (01) : 107 - 113
  • [13] Autoscaling for Hadoop Clusters
    Gandhi, Anshul
    Thota, Sidhartha
    Dube, Parijat
    Kochut, Andrzej
    Zhang, Li
    [J]. PROCEEDINGS 2016 IEEE INTERNATIONAL CONFERENCE ON CLOUD ENGINEERING (IC2E), 2016, : 109 - 118
  • [14] Designing a MapReduce performance model in distributed heterogeneous platforms based on benchmarking approach
    Gandomi, Abolfazl
    Movaghar, Ali
    Reshadi, Midia
    Khademzadeh, Ahmad
    [J]. JOURNAL OF SUPERCOMPUTING, 2020, 76 (09) : 7177 - 7203
  • [15] google, Dataproc
  • [16] Kadirvel S, 2012, IEEE IC COMP COM NET
  • [17] Hadoop Performance Modeling for Job Estimation and Resource Provisioning
    Khan, Mukhtaj
    Jin, Yong
    Li, Maozhen
    Xiang, Yang
    Jiang, Changjun
    [J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2016, 27 (02) : 441 - 454
  • [18] Lama P, 2012, P 9 INT C AUT COMP, P63, DOI DOI 10.1145/2371536.2371547
  • [19] Melnik S, 2010, PROC VLDB ENDOW, V3, P330
  • [20] Estimating runtime of a job in Hadoop MapReduce
    Peyravi, Narges
    Moeini, Ali
    [J]. JOURNAL OF BIG DATA, 2020, 7 (01)