Predictive modelling of MapReduce job performance in cloud environments using machine learning techniques

被引:0
作者
Bergui, Mohammed [1 ]
Hourri, Soufiane [1 ,2 ]
Najah, Said [1 ]
Nikolov, Nikola S. [3 ]
机构
[1] Univ Sidi Mohammed Ben Abdellah, Fac Sci & Technol, Dept Comp Sci, Lab Intelligent Syst & Applicat, Fes, Morocco
[2] Univ Cadi Ayyad, Higher Sch Technol, Lab Proc Ind Signals & Comp Sci, Safi, Morocco
[3] Univ Limerick, Dept Comp Sci & Informat Syst, Limerick, Ireland
关键词
Hadoop; MapReduce; Big data; Performance modelling; Runtime prediction; Machine learning;
D O I
10.1186/s40537-024-00964-z
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Within the Hadoop ecosystem, MapReduce stands as a cornerstone for managing, processing, and mining large-scale datasets. Yet, the absence of efficient solutions for precise estimation of job execution times poses a persistent challenge, impacting task allocation and distribution within Hadoop clusters. In this study, we present a comprehensive machine learning approach for predicting the execution time of MapReduce jobs, encompassing data collection, preprocessing, feature engineering, and model evaluation. Leveraging a rich dataset derived from comprehensive Hadoop MapReduce job traces, we explore the intricate relationship between cluster parameters and job performance. Through a comparative analysis of machine learning models, including linear regression, decision tree, random forest, and gradient-boosted regression trees, we identify the random forest model as the most effective, demonstrating superior predictive accuracy and robustness. Our findings underscore the critical role of features such as data size and resource allocation in determining job performance. With this work, we aim to enhance resource management efficiency and enable more effective utilisation of cloud-based Hadoop clusters for large-scale data processing tasks.
引用
收藏
页数:20
相关论文
共 32 条
  • [1] [Anonymous], 2016, Apache Hadoop
  • [2] [Anonymous], Apache hbase
  • [3] [Anonymous], Apache mahout
  • [4] [Anonymous], APACHE STORM
  • [5] apache, MapReduce Tutorial - Official Documentation
  • [6] apache, Apache Giraph
  • [7] apache, Apache Oozie
  • [8] Babu S., 2010, SoCC, P137, DOI DOI 10.1145/1807128.1807150
  • [9] Hadoop Dataset for Job Estimation in the Cloud with Limited Bandwidth
    Bergui, Mohammed
    Nikolov, Nikola S.
    Najah, Said
    [J]. ADVANCES IN INFORMATION AND COMMUNICATION, FICC, VOL 2, 2023, 652 : 341 - 348
  • [10] Benchmarking and Performance Modelling of MapReduce Communication Pattern
    Ceesay, Sheriffo
    Barker, Adam
    Lin, Yuhui
    [J]. 11TH IEEE INTERNATIONAL CONFERENCE ON CLOUD COMPUTING TECHNOLOGY AND SCIENCE (CLOUDCOM 2019), 2019, : 127 - 134