MRSIM: Mitigating Reducer Skew In MapReduce

被引:8
作者
Chen, Lei [1 ]
Lu, Wei [1 ]
Che, Xiaoping [1 ]
Xing, Weiwei [1 ]
Wang, Liqiang [2 ]
Yang, Yong [1 ]
机构
[1] Beijing Jiaotong Univ, Sch Software Engn, Beijing, Peoples R China
[2] Univ Cent Florida, Dept Comp Sci, Orlando, FL 32816 USA
来源
2017 31ST IEEE INTERNATIONAL CONFERENCE ON ADVANCED INFORMATION NETWORKING AND APPLICATIONS WORKSHOPS (IEEE WAINA 2017) | 2017年
基金
中国国家自然科学基金; 美国国家科学基金会;
关键词
D O I
10.1109/WAINA.2017.94
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
MapReduce has emerged as a popular programming model in the field of data-intensive computing. This is due to its simplistic design, which provides ease of use for programmers, and its framework implementations such as Hadoop, which have been adopted by large business and technology companies. One significant issue in practical MapReduce applications is data skew: the imbalance in the amount of data assigned to each task. This causes some tasks to take much longer to finish than others and can significantly impact performance. Existing solutions for the data skew in reduce side increase the overhead that the users need to customize a novel partitioner for the specific application, or perform additional sampling processes before the map function begins. To mitigate the data skew in reduce side, which is called Reducer skew in this paper, we proposed a load balancing strategy based on load statistics, namely MRSIM. To gets the input data distribution in reduce stage, MRSIM computed the statistics while preparing data, which makes full use of the shuffle stage in MapReduce. To balance the load of entire cluster, MRSIM reallocated reduce tasks on the heavy nodes to idle ones according to the data distribution. In addition, by introducing the load feedback mechanism, MRSIM further improved the cluster's performance when running complex applications. We evaluated MRSIM in YARN (Hadoop 2.2.0), the experimental results show that our MRSIM outperformed the default strategy in native Hadoop greatly, the improvement in execution time reached 17%
引用
收藏
页码:379 / 384
页数:6
相关论文
共 17 条
[1]  
Acharya S, 2000, SIGMOD REC, V29, P487
[2]  
Ahmad Faraz., 2012, ACM SIGARCH Computer Architecture News, V40, P61
[3]  
[Anonymous], 2002, Glottometrics, DOI DOI 10.1109/S0SE.2014.50
[4]   LIBRA: Lightweight Data Skew Mitigation in MapReduce [J].
Chen, Qi ;
Yao, Jinyu ;
Xiao, Zhen .
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2015, 26 (09) :2520-2533
[5]  
Dean J, 2004, USENIX ASSOCIATION PROCEEDINGS OF THE SIXTH SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION (OSDE '04), P137
[6]   Building a High-Level Dataflow System on top of Map-Reduce: The Pig Experience [J].
Gates, Alan F. ;
Natkovich, Olga ;
Chopra, Shubham ;
Kamath, Pradeep ;
Narayanamurthy, Shravan M. ;
Olston, Christopher ;
Reed, Benjamin ;
Srinivasan, Santhosh ;
Srivastava, Utkarsh .
PROCEEDINGS OF THE VLDB ENDOWMENT, 2009, 2 (02) :1414-1425
[7]   CAP3: A Cloud Auto-Provisioning Framework for Parallel Processing Using On-demand and Spot Instances [J].
Huang, He ;
Wang, Liqiang ;
Tak, Byung Chul ;
Wang, Long ;
Tang, Chunqiang .
2013 IEEE SIXTH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING (CLOUD 2013), 2013, :228-235
[8]  
Ibrahim S., 2010, Proceedings of the 2010 IEEE 2nd International Conference on Cloud Computing Technology and Science (CloudCom 2010), P17, DOI 10.1109/CloudCom.2010.25
[9]  
Isard M., 2007, Operating Systems Review, V41, P59, DOI 10.1145/1272998.1273005
[10]  
Kwon YongChul., 2010, SOCC