An intermediate data placement algorithm for load balancing in Spark computing environment

被引:47
作者
Tang, Zhuo [1 ]
Zhang, Xiangshen [1 ]
Li, Kenli [1 ]
Li, Keqin [1 ,2 ]
机构
[1] Hunan Univ, Coll Informat Sci & Engn, Changsha 410082, Hunan, Peoples R China
[2] SUNY Coll New Paltz, Dept Comp Sci, New Paltz, NY 12561 USA
来源
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE | 2018年 / 78卷
基金
中国国家自然科学基金;
关键词
Data sampling; Data skew; Load balancing; MapReduce; Spark; IMPROVING MAPREDUCE PERFORMANCE;
D O I
10.1016/j.future.2016.06.027
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Since MapReduce became an effective and popular programming framework for parallel data processing, key skew in intermediate data has become one of the important system performance bottlenecks. For solving the load imbalance of bucket containers in the shuffle process of the Spark computing framework, this paper proposes a splitting and combination algorithm for skew intermediate data blocks (SCID), which can improve the load balancing for various reduce tasks. Because the number of keys cannot be counted out until the input data are processed by map tasks, this paper provides a sampling algorithm based on reservoir sampling to detect the distribution of the keys in intermediate data. Contrasting with the original mechanism for bucket data loading, SOD sorts the data clusters of key/value tuples from each map task according to their sizes, and fills them into the relevant buckets orderly. A data cluster will be split once it exceeds the residual volume of the current bucket. After filling this bucket, the remainder cluster will be entered into the next iteration. Through this processing, the total size of data in each bucket is roughly scheduled equally. For each map task, each reduce task should fetch the intermediate results from a specific bucket, the quantity in all buckets for a map task will balance the load of the reduce tasks. We implement SCID in Spark 1.1.0 and evaluate its performance through three widely used benchmarks: Sort, Text Search, and Word Count. Experimental results show that our algorithms can not only achieve higher overall average balancing performance, but also reduce the execution time of a job with varying degrees of data skew. (C) 2016 Elsevier B.V. All rights reserved.
引用
收藏
页码:287 / 301
页数:15
相关论文
共 27 条
[1]  
[Anonymous], 2010, EDBT, DOI [DOI 10.1145/1739041.1739056, 10.1145/1739041.1739056]
[2]  
[Anonymous], P 1 INT C CLOUD COMP
[3]  
[Anonymous], IEEE T CLOUD COMPUTI
[4]  
[Anonymous], 2010, P ACM SIGMOD INT C M, DOI DOI 10.1145/1807167.1807273
[5]  
[Anonymous], 2012, P 7 WORKSH LARG SCAL
[6]  
[Anonymous], 2002, Glottometrics, DOI DOI 10.1109/S0SE.2014.50
[7]   Improving MapReduce Performance Using Smart Speculative Execution Strategy [J].
Chen, Qi ;
Liu, Cheng ;
Xiao, Zhen .
IEEE TRANSACTIONS ON COMPUTERS, 2014, 63 (04) :954-967
[8]  
Dean J, 2004, USENIX ASSOCIATION PROCEEDINGS OF THE SIXTH SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION (OSDE '04), P137
[9]   Mapreduce: Simplified data processing on large clusters [J].
Dean, Jeffrey ;
Ghemawat, Sanjay .
COMMUNICATIONS OF THE ACM, 2008, 51 (01) :107-113
[10]   Improving MapReduce Performance by Balancing Skewed Loads [J].
Fan Yuanquan ;
Wu Weiguo ;
Xu Yunlong ;
Chen Heng .
CHINA COMMUNICATIONS, 2014, 11 (08) :85-108