An Uncoupled Data Process and Transfer Model for MapReduce

被引:1
作者
Zha, Li [1 ]
Zhang, Jie [1 ,2 ]
Liu, Wei [1 ,2 ]
Lin, Jian [1 ,2 ]
机构
[1] Chinese Acad Sci, Inst Comp Technol, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Beijing, Peoples R China
来源
TRANSACTIONS ON LARGE-SCALE DATA- AND KNOWLEDGE- CENTERED SYSTEMS XVII | 2015年 / 8970卷
关键词
MapReduce; Data transfer; Uncoupled model; Compression;
D O I
10.1007/978-3-662-46335-2_2
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In the original MapReduce model, reduce tasks need to fetch output data of map tasks in the manner of "pull". However, reduce tasks which are occupying reduce slots cannot start executing until all the corresponding map tasks are completed. It forms the dependence between map and reduce tasks, which is called the coupled relationship in this paper. The coupled relationship leads to two problems: reduce slot hoarding and underutilized network bandwidth. Meanwhile, storing the result data is costly especially when the system has replications, which leads to the inefficient storage problem. We propose an uncoupled data process and transfer model in order to address these problems. Four core techniques, including weighted mapping, data pushing, partial data backup, and data compression are introduced and applied in Apache Hadoop, the mainstream open-source implementation of MapReduce model. This work has been practiced in Baidu, the biggest search engine company in China. A real-world application for web data processing shows that our model can improve the system throughput by 29.5%, reduce the total wall time by 22.8%, provide a weighted wall time acceleration of 26.3%, and reduce the result data stored in disk by 70%. What's more, the implementation of this model is transparent to users and compatible with the original Hadoop.
引用
收藏
页码:24 / 44
页数:21
相关论文
共 20 条
[1]  
[Anonymous], 2009, Tech. Rep., Technical Report UCB/EECS-2009-55
[2]  
Cao P., 1995, Performance Evaluation Review, V23, P188, DOI 10.1145/223586.223608
[3]   Bigtable: A distributed storage system for structured data [J].
Chang, Fay ;
Dean, Jeffrey ;
Ghemawat, Sanjay ;
Hsieh, Wilson C. ;
Wallach, Deborah A. ;
Burrows, Mike ;
Chandra, Tushar ;
Fikes, Andrew ;
Gruber, Robert E. .
ACM TRANSACTIONS ON COMPUTER SYSTEMS, 2008, 26 (02)
[4]  
Chen Y., 2010, GREEN NETWORKING, P23, DOI DOI 10.1145/1851290.1851296
[5]   Managing Data Transfers in Computer Clusters with Orchestra [J].
Chowdhury, Mosharaf ;
Zaharia, Matei ;
Ma, Justin ;
Jordan, Michael I. ;
Stoica, Ion .
ACM SIGCOMM COMPUTER COMMUNICATION REVIEW, 2011, 41 (04) :98-109
[6]  
Condie T., 2010, P 7 USENIX C NETW SY, P1
[7]  
Dean J, 2004, USENIX ASSOCIATION PROCEEDINGS OF THE SIXTH SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION (OSDE '04), P137
[8]  
Ghemawat S, 2003, ACM SIGOPS Operating Systems Review, P29, DOI [10.1145/1165389.945450, 10.1145/945445.945450]
[9]  
Gu X., 2011, P 1 WORKSH ARCH SYST, P34
[10]  
Gu Y., 2008, ARXIV08091181