Runtime prediction of big data jobs: performance comparison of machine learning algorithms and analytical models

被引:0
作者
Nasim Ahmed
Andre L. C. Barczak
Mohammad A. Rashid
Teo Susnjak
机构
[1] Massey University,School of Mathematical and Computational Sciences
[2] Massey University,Department of Mechanical and Electrical Engineering
来源
Journal of Big Data | / 9卷
关键词
Big data; Performance prediction; Machine learning; System configuration; HiBench; Apache Spark; Extrapolation and interpolation;
D O I
暂无
中图分类号
学科分类号
摘要
Due to the rapid growth of available data, various platforms offer parallel infrastructure that efficiently processes big data. One of the critical issues is how to use these platforms to optimise resources, and for this reason, performance prediction has been an important topic in the last few years. There are two main approaches to the problem of predicting performance. One is to fit data into an equation based on a analytical models. The other is to use machine learning (ML) in the form of regression algorithms. In this paper, we have investigated the difference in accuracy for these two approaches. While our experiments used an open-source platform called Apache Spark, the results obtained by this research are applicable to any parallel platform and are not constrained to this technology. We found that gradient boost, an ML regressor, is more accurate than any of the existing analytical models as long as the range of the prediction follows that of the training. We have investigated analytical and ML models based on interpolation and extrapolation methods with k-fold cross-validation techniques. Using the interpolation method, two analytical models, namely 2D-plate and fully-connected models, outperform older analytical models and kernel ridge regression algorithm but not the gradient boost regression algorithm. We found the average accuracy of 2D-plate and fully-connected models using interpolation are 0.962 and 0.961. However, when using the extrapolation method, the analytical models are much more accurate than the ML regressors, particularly two of the most recently proposed models (2D-plate and fully-connected). Both models are based on the communication patterns between the nodes. We found that using extrapolation, kernel ridge, gradient boost and two proposed analytical models average accuracy is 0.466, 0.677, 0.975, and 0.981, respectively. This study shows that practitioners can benefit from analytical models by being able to accurately predict the runtime outside of the range of the training data using only a few experimental operations.
引用
收藏
相关论文
共 98 条
[1]  
Ghani NA(2019)Social media big data analytics: a survey Comput Hum Behav 101 417-428
[2]  
Hamid S(2016)Computational health informatics in the big data age: a survey ACM Comput Surv 49 1-36
[3]  
Hashem IAT(2015)Advances in natural language processing Science 349 261-266
[4]  
Ahmed E(2016)Big data analytics on apache spark Int J Data Sci Anal 1 145-164
[5]  
Fang R(2021)A recommendation engine for predicting movie ratings using a big data approach Electronics 10 1215-1241
[6]  
Pouyanfar S(2016)Mllib: machine learning in apache spark J Mach Learn Res 17 1235-18
[7]  
Yang Y(2020)A comprehensive performance analysis of Apache Hadoop and Apache Spark for large scale data sets using HiBench J Big Data 7 1-1122
[8]  
Chen S-C(2011)Profiling, what-if analysis, and cost-based optimization of mapreduce programs Proc VLDB Endow 4 1111-3778
[9]  
Iyengar S(2018)A machine learning approach for predicting execution time of spark jobs Alex Eng J 57 3767-51
[10]  
Hirschberg J(2021)Efficient performance prediction for apache spark J Parallel Distrib Comput 149 40-28