Load balancing in reducers for skewed data in MapReduce systems by using scalable simple random sampling

被引:0
作者
Elaheh Gavagsaz
Ali Rezaee
Hamid Haj Seyyed Javadi
机构
[1] Islamic Azad University,Department of Computer Engineering, Science and Research Branch
[2] Shahed University,Department of Applied Mathematics, Faculty of Mathematics and Computer Science
来源
The Journal of Supercomputing | 2018年 / 74卷
关键词
Load balancing; Data sampling; Data skew; MapReduce; Spark;
D O I
暂无
中图分类号
学科分类号
摘要
MapReduce has demonstrated itself to be as a highly efficient programming model for processing massive dataset on the distributed system. One of the most important obstacles hindering the performance of MapReduce is data skewness. The presence of data skewness leads to considerable load imbalance on the reducers and performance degradation. In this paper, the problem of how to efficiently accommodate intermediate data to even up the load of all reducers is studied when encountering skewed data. A scalable sampling algorithm is used which it can observe a more precise approximate distribution of the keys by sampling only a small fraction of the intermediate data. Afterwards, it is applied to evaluate the overall distribution of the keys. In addition, we propose a sorted-balance algorithm based on sampling results: sorted-balance algorithm using scalable simple random sampling (SBaSC). This work not only puts forward a load-balanced partitioning strategy, but also proves a significant approximation ratio of SBaSC. The experiments confirm that our solution attains a better execution time and load balancing results.
引用
收藏
页码:3415 / 3440
页数:25
相关论文
共 50 条
[1]  
Akoka J(2017)Research on big data—a systematic mapping study Comput Stand Interfaces 54 105-115
[2]  
Comyn-Wattiau I(2017)Addressing barriers to big data Bus Horizons 60 285-292
[3]  
Laoufi N(2014)A survey of clustering algorithms for big data: taxonomy and empirical analysis IEEE Trans Emerg Top Comput 2 267-279
[4]  
Alharthi A(2017)Big data: dimensions, evolution, impacts, and challenges Bus Horizons 60 293-303
[5]  
Krotov V(2012)Parallel processing of cluster by MapReduce Int J Distrib Parallel Syst 3 167-179
[6]  
Bowman M(2014)Balancing reducer workload for skewed data using sampling-based partitioning Comput Electr Eng 40 675-687
[7]  
Fahad A(1993)A symmetric fragment and replicate algorithm for distributed joins IEEE Trans Parallel Distrib Syst 4 1345-1354
[8]  
Alshatri N(2015)Load-balancing the distance computations in record linkage SIGKDD Explor Newsl 17 1-7
[9]  
Tari Z(2017)Map-balance-reduce: an improved parallel programming model for load balancing of MapReduce Future Gener Comput Syst 26 2520-2533
[10]  
Alamri A(2015)LIBRA: lightweight data skew mitigation in MapReduce IEEE Trans Parallel Distrib Syst 78 287-301