Predictive modelling of MapReduce job performance in cloud environments using machine learning techniques

被引：0

作者：

Bergui, Mohammed ^{[1
]}

Hourri, Soufiane ^{[1
,2
]}

Najah, Said ^{[1
]}

Nikolov, Nikola S. ^{[3
]}

机构：

[1] Univ Sidi Mohammed Ben Abdellah, Fac Sci & Technol, Dept Comp Sci, Lab Intelligent Syst & Applicat, Fes, Morocco

[2] Univ Cadi Ayyad, Higher Sch Technol, Lab Proc Ind Signals & Comp Sci, Safi, Morocco

[3] Univ Limerick, Dept Comp Sci & Informat Syst, Limerick, Ireland

来源：

JOURNAL OF BIG DATA | 2024年 / 11卷 / 01期

关键词：

Hadoop; MapReduce; Big data; Performance modelling; Runtime prediction; Machine learning;

D O I：

10.1186/s40537-024-00964-z

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Within the Hadoop ecosystem, MapReduce stands as a cornerstone for managing, processing, and mining large-scale datasets. Yet, the absence of efficient solutions for precise estimation of job execution times poses a persistent challenge, impacting task allocation and distribution within Hadoop clusters. In this study, we present a comprehensive machine learning approach for predicting the execution time of MapReduce jobs, encompassing data collection, preprocessing, feature engineering, and model evaluation. Leveraging a rich dataset derived from comprehensive Hadoop MapReduce job traces, we explore the intricate relationship between cluster parameters and job performance. Through a comparative analysis of machine learning models, including linear regression, decision tree, random forest, and gradient-boosted regression trees, we identify the random forest model as the most effective, demonstrating superior predictive accuracy and robustness. Our findings underscore the critical role of features such as data size and resource allocation in determining job performance. With this work, we aim to enhance resource management efficiency and enable more effective utilisation of cloud-based Hadoop clusters for large-scale data processing tasks.

引用

页数：20

共 32 条

[11] Machine Learning-Based Configuration Parameter Tuning on Hadoop System
Chen, Chi-Ou
Zhuo, Ye-Qi
Yeh, Chao-Chun
Lin, Che-Min
Liao, Shih-wei
[J]. 2015 IEEE INTERNATIONAL CONGRESS ON BIG DATA - BIGDATA CONGRESS 2015, 2015, : 386 - 392
[12] Mapreduce: Simplified data processing on large clusters
Dean, Jeffrey
Ghemawat, Sanjay
[J]. COMMUNICATIONS OF THE ACM, 2008, 51 (01) : 107 - 113
[13] Autoscaling for Hadoop Clusters
Gandhi, Anshul
Thota, Sidhartha
Dube, Parijat
Kochut, Andrzej
Zhang, Li
[J]. PROCEEDINGS 2016 IEEE INTERNATIONAL CONFERENCE ON CLOUD ENGINEERING (IC2E), 2016, : 109 - 118
[14] Designing a MapReduce performance model in distributed heterogeneous platforms based on benchmarking approach
Gandomi, Abolfazl
Movaghar, Ali
Reshadi, Midia
Khademzadeh, Ahmad
[J]. JOURNAL OF SUPERCOMPUTING, 2020, 76 (09) : 7177 - 7203
[15] google, Dataproc
[16] Kadirvel S, 2012, IEEE IC COMP COM NET
[17] Hadoop Performance Modeling for Job Estimation and Resource Provisioning
Khan, Mukhtaj
Jin, Yong
Li, Maozhen
Xiang, Yang
Jiang, Changjun
[J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2016, 27 (02) : 441 - 454
[18] Lama P, 2012, P 9 INT C AUT COMP, P63, DOI DOI 10.1145/2371536.2371547
[19] Melnik S, 2010, PROC VLDB ENDOW, V3, P330
[20] Estimating runtime of a job in Hadoop MapReduce
Peyravi, Narges
Moeini, Ali
[J]. JOURNAL OF BIG DATA, 2020, 7 (01)

← 1 2 3 4 →