Performance optimization for short job execution in Hadoop MapReduce

被引:0
作者
Gu, Rong [1 ]
Yan, Jinshuang [1 ]
Yang, Xiaoliang [1 ]
Yuan, Chunfeng [1 ]
Huang, Yihua [1 ]
机构
[1] State Key Laboratory for Novel Software Technology (Nanjing University)
来源
Jisuanji Yanjiu yu Fazhan/Computer Research and Development | 2014年 / 51卷 / 06期
关键词
Big data processing; MapReduce; Parallel computing; Performance optimization; Short job;
D O I
10.7544/issn1000-1239.2014.20130130
中图分类号
学科分类号
摘要
Hadoop MapReduce is a widely used parallel computing framework for solving data-intensive problems. Now days, for its good capability for processing large scale data, Hadoop MapReduce has also been adopted in many query applications. To be able to process large scale datasets, the fundamental design of the standard Hadoop places more emphasis on the high-throughput of data than on the job execution performance. This causes performance limitation when we use Hadoop MapReduce to execute short jobs. This paper proposes several optimization methods to improve the execution performance of MapReduce jobs, especially for short jobs. We make three major optimizations: 1) reduce the time cost during the initialization and termination stages of a job by optimizing its setup and cleanup tasks; 2) change the assignment model of the first batch of tasks from the pull model to the push model; 3) replace the heartbeat-base communication mechanism with an instant message communication mechanism for event notifications between the JobTracker and TaskTrackers. We also adopt a typical MapReduce-based parallel query application, BLAST, to evaluate the effects of our optimizations. Experimental results show that the job execution performance of our improved version of Hadoop is about 23% faster on average than the standard Hadoop for different types of BLAST MapReduce jobs.
引用
收藏
页码:1270 / 1280
页数:10
相关论文
共 22 条
[1]  
Dean J., Ghemawat S., MapReduce: Simplified data processing on large clusters, Communications of the ACM, 51, 1, pp. 107-113, (2008)
[2]  
Message passing interface standard
[3]  
Parallel virtual machine
[4]  
Apache hadoop
[5]  
Li J., Cui J., Wang D., Et al., Survey of MapReduce parallel programming model, Acta Electronica Sinica, 39, 11, pp. 2635-2642, (2011)
[6]  
Gillick D., Faria A., Denero J., MapReduce: Distributed computing for machine learning
[7]  
Lu W., Du C., Wei B., Et al., Distributed affinity propagation clustering based on MapReduce, Journal of Computer Research and Development, 49, 8, pp. 1762-1772, (2012)
[8]  
Wang P., Meng D., Zhan J., Et al., Review of programming models for data-intensive computing, Journal of Computer Research and Development, 47, 11, pp. 1993-2002, (2010)
[9]  
Apache Hadoop fair scheduler
[10]  
Zaharia M., Borthakur D., Sarma J., Et al., Delay scheduling: A simple technique for achieving locality and fairness in cluster scheduling, Proc of the 5th European Conf on Computer Systems, pp. 265-278, (2010)