Multi-file Queries Performance Improvement through Data Placement in Hadoop

被引:0
作者
Tang, Yu [1 ]
Abdulhay, Elham [1 ]
Fan, Aihua
Su, Sheng [1 ]
Gebreselassie, Kidus [1 ]
机构
[1] Univ Elect Sci & Technol China, Chengdu 611731, Peoples R China
来源
PROCEEDINGS OF 2012 2ND INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND NETWORK TECHNOLOGY (ICCSNT 2012) | 2012年
关键词
HDFS; Block Placement; Data locality; Correlation;
D O I
暂无
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Hadoop is enjoying popularity for processing data-intensive jobs because of its data locality feature. However, the performance gained from Hadoop's above feature is currently limited by its default block placement policy, which implicitly assumes instances of MapReduce jobs access data from a single file. On the contrary, multi-file queries like indexing query or aggregation query need to process related data from more than one files found on different DataNodes of a cluster. In this paper we proposed a Correlation-based Block Placement (CBP) Algorithm that enhances the performance of these queries by placing related blocks on the same set of DataNodes. Furthermore, we developed a customized InputFormat that enables InputSplits contain records from different files. Simulation results demonstrated that the number of migrating data blocks for CBP was insignificant. On the contrary, for default policy case, the number of migrating data blocks increased significantly with the input dataset size. As a result, for any input dataset size, the performance of CBP exceeded that of the default policy.
引用
收藏
页码:986 / 991
页数:6
相关论文
共 17 条
[1]  
Abouzied A., 2010, P 2010 INT C MAN DAT
[2]  
Blanas S., 2010, P 2010 INT C MAN DAT
[3]  
Bo Dong, 2010, Proceedings of the 2010 IEEE 2nd International Conference on Cloud Computing Technology and Science (CloudCom 2010), P41, DOI 10.1109/CloudCom.2010.60
[4]  
Bo Dong, 2010, 2010 IEEE 7th International Conference on Services Computing (SCC 2010), P65, DOI 10.1109/SCC.2010.72
[5]  
Buck JoeB., 2011, P 2011 INT C HIGH PE, p66:1
[6]  
Dean J, 2004, USENIX ASSOCIATION PROCEEDINGS OF THE SIXTH SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION (OSDE '04), P137
[7]  
Dittrich J, 2010, PROC VLDB ENDOW, V3, P518
[8]   CoHadoop: Flexible Data Placement and Its Exploitation in Hadoop [J].
Eltabakh, Mohamed Y. ;
Tian, Yuanyuan ;
Ozcan, Fatma ;
Gemulla, Rainer ;
Krettek, Aljoscha ;
McPherson, John .
PROCEEDINGS OF THE VLDB ENDOWMENT, 2011, 4 (09) :575-585
[9]   Column-Oriented Storage Techniques for MapReduce [J].
Floratou, Avrilia ;
Patel, Jignesh M. ;
Shekita, Eugene J. ;
Tata, Sandeep .
PROCEEDINGS OF THE VLDB ENDOWMENT, 2011, 4 (07) :419-429
[10]   Toward Efficient and Simplified Distributed Data Intensive Computing [J].
Gu, Yunhong ;
Grossman, Robert .
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2011, 22 (06) :974-984