Improving MapReduce Performance with Partial Speculative Execution

被引:0
作者
Yaoguang Wang
Weiming Lu
Renjie Lou
Baogang Wei
机构
[1] Zhejiang University,College of Computer Science
来源
Journal of Grid Computing | 2015年 / 13卷
关键词
Speculative execution; MapReduce performance; Straggler mitigation;
D O I
暂无
中图分类号
学科分类号
摘要
The MapReduce framework has become the de facto standard for big data processing due to its attractive features and abilities. One is that it automatically parallelizes a job into multiple tasks and transparently handles task execution on a large cluster of commodity machines. The increasing heterogeneity of distributed environments may result in a few straggling tasks, which prolong job completion. Speculative execution is proposed to mitigate stragglers. However, the existing speculative execution mechanism could not work efficiently as many speculative tasks are still slower than their original tasks. In this paper, we explore an approach to increase the efficiency of speculative execution, and further improve MapReduce performance. We propose the Partial Speculative Execution (PSE) strategy to make speculative tasks start from the checkpoint. By leveraging the checkpoint of original tasks, PSE can eliminate the costs of re-reading, re-copying, and re-computing the processed data. We implement PSE in Hadoop, and evaluate its performance in terms of job completion time and the efficiency of speculative execution under several kinds of classical workloads. Experimental results show that, in heterogeneous environments with stragglers, PSE completes jobs 56 % faster than that with no speculation and 12 % faster than that with LATE, an improved speculative execution algorithm. In addition, on average PSE can improve the efficiency of speculative execution by 24 % compared to LATE.
引用
收藏
页码:587 / 604
页数:17
相关论文
共 54 条
[1]  
Dean J(2008)Mapreduce: simplified data processing on large clusters Commun. ACM 51 107-113
[2]  
Ghemawat S(2013)Piranha: Optimizing short jobs in hadoop Proc. VLDB Endowment 6 985-996
[3]  
Elmeleegy K(2014)Shadoop: Improving mapreduce performance by optimizing job execution mechanism in hadoop clusters J. Parallel Distrib. Comput. 74 2166-2179
[4]  
Gu R(2014)Adaptable scheduling algorithm for grids with resource redeployment capability J. Grid Computing 12 447-463
[5]  
Yang X(2012)Opening the black boxes in data flow optimization Proc. VLDB Endowment 5 1256-1267
[6]  
Yan J(2007)Dryad: distributed data-parallel programs from sequential building blocks ACM SIGOPS Oper. Syst. Rev. 41 59-72
[7]  
Sun Y(2013)Managing skew in hadoop IEEE Data Eng Bull 36 24-33
[8]  
Wang B(2012)Early accurate results for advanced analytics on mapreduce Proc. VLDB Endowment 5 1028-1039
[9]  
Yuan C(2012)Stubby: A transformation-based optimizer for mapreduce workflows Proc. VLDB Endowment 5 1196-1207
[10]  
Huang Y(2014)Survey on grid resource allocation mechanisms J. Grid Computing 12 399-441