Handling partitioning skew in MapReduce using LEEN

被引:0
作者
Shadi Ibrahim
Hai Jin
Lu Lu
Bingsheng He
Gabriel Antoniu
Song Wu
机构
[1] INRIA Rennes-Bretagne Atlantique,Cluster and Grid Computing Lab, Services Computing Technology and System Lab
[2] Huazhong University of Science and Technology,School of Computer Engineering
[3] Nanyang Technological University,undefined
来源
Peer-to-Peer Networking and Applications | 2013年 / 6卷
关键词
MapReduce; Hadoop; Cloud computing; Skew partitioning; Intermediate data;
D O I
暂无
中图分类号
学科分类号
摘要
MapReduce is emerging as a prominent tool for big data processing. Data locality is a key feature in MapReduce that is extensively leveraged in data-intensive cloud systems: it avoids network saturation when processing large amounts of data by co-allocating computation and data storage, particularly for the map phase. However, our studies with Hadoop, a widely used MapReduce implementation, demonstrate that the presence of partitioning skew (Partitioning skew refers to the case when a variation in either the intermediate keys’ frequencies or their distributions or both among different data nodes) causes a huge amount of data transfer during the shuffle phase and leads to significant unfairness on the reduce input among different data nodes. As a result, the applications severe performance degradation due to the long data transfer during the shuffle phase along with the computation skew, particularly in reduce phase. In this paper, we develop a novel algorithm named LEEN for locality-aware and fairness-aware key partitioning in MapReduce. LEEN embraces an asynchronous map and reduce scheme. All buffered intermediate keys are partitioned according to their frequencies and the fairness of the expected data distribution after the shuffle phase. We have integrated LEEN into Hadoop. Our experiments demonstrate that LEEN can efficiently achieve higher locality and reduce the amount of shuffled data. More importantly, LEEN guarantees fair distribution of the reduce inputs. As a result, LEEN achieves a performance improvement of up to 45 % on different workloads.
引用
收藏
页码:409 / 424
页数:15
相关论文
共 12 条
[1]  
Dean J(2008)Mapreduce: simplified data processing on large clusters Commun ACM 51 107-113
[2]  
Ghemawat S(2003)The Google file system SIGOPS - Oper Syst Rev 37 29-43
[3]  
Ghemawat S(2009)Cloudburst: highly sensitive read mapping with mapreduce Bioinformatics 25 1363-1369
[4]  
Gobioff H(2011)Variable-sized map and locality-aware reduce on public-resource grids Futur Gener Comput Syst 27 843-849
[5]  
Leung S-T(1992)Parallel database systems: the future of high performance database systems Commun ACM 35 85-98
[6]  
Schatz MC(undefined)undefined undefined undefined undefined-undefined
[7]  
Su Y-L(undefined)undefined undefined undefined undefined-undefined
[8]  
Chen P-C(undefined)undefined undefined undefined undefined-undefined
[9]  
Chang J-B(undefined)undefined undefined undefined undefined-undefined
[10]  
Shieh C-K(undefined)undefined undefined undefined undefined-undefined