A comprehensive performance analysis of Apache Hadoop and Apache Spark for large scale data sets using HiBench

被引:52
作者
Ahmed, N. [1 ]
Barczak, Andre L. C. [1 ]
Susnjak, Teo [1 ]
Rashid, Mohammed A. [2 ]
机构
[1] Massey Univ, Sch Nat & Computat Sci, Auckland 0745, New Zealand
[2] Massey Univ, Dept Mech & Elect Engn, Auckland 0745, New Zealand
关键词
HiBench; BigData; Hadoop; MapReduce; Benchmark; Spark;
D O I
10.1186/s40537-020-00388-5
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Big Data analytics for storing, processing, and analyzing large-scale datasets has become an essential tool for the industry. The advent of distributed computing frameworks such as Hadoop and Spark offers efficient solutions to analyze vast amounts of data. Due to the application programming interface (API) availability and its performance, Spark becomes very popular, even more popular than the MapReduce framework. Both these frameworks have more than 150 parameters, and the combination of these parameters has a massive impact on cluster performance. The default system parameters help the system administrator deploy their system applications without much effort, and they can measure their specific cluster performance with factory-set parameters. However, an open question remains: can new parameter selection improve cluster performance for large datasets? In this regard, this study investigates the most impacting parameters, under resource utilization, input splits, and shuffle, to compare the performance between Hadoop and Spark, using an implemented cluster in our laboratory. We used a trial-and-error approach for tuning these parameters based on a large number of experiments. In order to evaluate the frameworks of comparative analysis, we select two workloads: WordCount and TeraSort. The performance metrics are carried out based on three criteria: execution time, throughput, and speedup. Our experimental results revealed that both system performances heavily depends on input data size and correct parameter selection. The analysis of the results shows that Spark has better performance as compared to Hadoop when data sets are small, achieving up to two times speedup in WordCount workloads and up to 14 times in TeraSort workloads when default parameter values are reconfigured.
引用
收藏
页数:18
相关论文
共 33 条
[1]   Gapprox: using Gallup approach for approximation in Big Data processing [J].
Ahmadvand, Hossein ;
Goudarzi, Maziar ;
Foroutan, Fouzhan .
JOURNAL OF BIG DATA, 2019, 6 (01)
[2]  
[Anonymous], 2016, MAN ASS IR BIG DAT C, DOI [10.4018/978-1-4666-9814-7, DOI 10.4018/978-1-4666-9814-7]
[3]   Grid'5000:: A large scale and highly reconfigurable experimental grid testbed [J].
Bolze, Raphael ;
Cappello, Franck ;
Caron, Eddy ;
Dayde, Michel ;
Desprez, Frederic ;
Jeannot, Emmanuel ;
Jegou, Yvon ;
Lanteri, Stephane ;
Leduc, Julien ;
Melab, Noredine ;
Mornet, Guillaume ;
Namyst, Raymond ;
Primet, Pascale ;
Quetier, Benjamin ;
Richard, Olivier ;
Talbi, El-Ghazali ;
Touche, Irea .
INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2006, 20 (04) :481-494
[4]   Machine Learning-Based Configuration Parameter Tuning on Hadoop System [J].
Chen, Chi-Ou ;
Zhuo, Ye-Qi ;
Yeh, Chao-Chun ;
Lin, Che-Min ;
Liao, Shih-wei .
2015 IEEE INTERNATIONAL CONGRESS ON BIG DATA - BIGDATA CONGRESS 2015, 2015, :386-392
[5]  
Dean J, 2004, USENIX ASSOCIATION PROCEEDINGS OF THE SIXTH SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION (OSDE '04), P137
[6]  
Gopalani S., 2015, Int J Comput Appl, V113, P25, DOI [10.5120/19788-0531, DOI 10.5120/19788-0531]
[7]   Memory or Time: Performance Evaluation for Iterative Operation on Hadoop and Spark [J].
Gu, Lei ;
Li, Huan .
2013 IEEE 15TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS & 2013 IEEE INTERNATIONAL CONFERENCE ON EMBEDDED AND UBIQUITOUS COMPUTING (HPCC_EUC), 2013, :721-727
[8]  
Huang SS, 2010, I C DATA ENGIN WORKS, P41, DOI 10.1109/ICDEW.2010.5452747
[9]  
KANNAN P, 2015, HADOOP MAPREDUCE APA
[10]  
Landset S., 2015, J Big Data, V2, P24, DOI DOI 10.1186/S40537-015-0032-1