Load balancing in reducers for skewed data in MapReduce systems by using scalable simple random sampling

被引:0
作者
Elaheh Gavagsaz
Ali Rezaee
Hamid Haj Seyyed Javadi
机构
[1] Islamic Azad University,Department of Computer Engineering, Science and Research Branch
[2] Shahed University,Department of Applied Mathematics, Faculty of Mathematics and Computer Science
来源
The Journal of Supercomputing | 2018年 / 74卷
关键词
Load balancing; Data sampling; Data skew; MapReduce; Spark;
D O I
暂无
中图分类号
学科分类号
摘要
MapReduce has demonstrated itself to be as a highly efficient programming model for processing massive dataset on the distributed system. One of the most important obstacles hindering the performance of MapReduce is data skewness. The presence of data skewness leads to considerable load imbalance on the reducers and performance degradation. In this paper, the problem of how to efficiently accommodate intermediate data to even up the load of all reducers is studied when encountering skewed data. A scalable sampling algorithm is used which it can observe a more precise approximate distribution of the keys by sampling only a small fraction of the intermediate data. Afterwards, it is applied to evaluate the overall distribution of the keys. In addition, we propose a sorted-balance algorithm based on sampling results: sorted-balance algorithm using scalable simple random sampling (SBaSC). This work not only puts forward a load-balanced partitioning strategy, but also proves a significant approximation ratio of SBaSC. The experiments confirm that our solution attains a better execution time and load balancing results.
引用
收藏
页码:3415 / 3440
页数:25
相关论文
共 50 条
[11]  
Khalil I(2018)An intermediate data placement algorithm for load balancing in Spark computing environment Future Gener Comput Syst 11 37-57
[12]  
Zomaya AY(1985)Random sampling with a reservoir ACM Trans Math Softw 26 261-268
[13]  
Foufou S(1977)List sequential sampling with equal or unequal probabilities without placement J R Stat Soc Ser C (Appl Stat) 7 448-461
[14]  
Bouras A(1973)Time bounds for selection J Comput Syst Sci 17 416-429
[15]  
Lee I(1969)Bounds on multiprocessing timing anomalies SIAM J Appl Math 313 1200-undefined
[16]  
Vaidya M(1996)Statistics notes: detecting skewness from summary information BMJ undefined undefined-undefined
[17]  
Xu Y(undefined)undefined undefined undefined undefined-undefined
[18]  
Qu W(undefined)undefined undefined undefined undefined-undefined
[19]  
Li Z(undefined)undefined undefined undefined undefined-undefined
[20]  
Liu Z(undefined)undefined undefined undefined undefined-undefined