Equi-join for Multiple Datasets Based on Time Cost Evaluation Model

被引:0
作者
Zhu, Hong [1 ]
Xia, Libo [1 ]
Xie, Mieyi [1 ]
Yan, Ke [1 ]
机构
[1] Huazhong Univ Sci & Technol, Sch Comp Sci & Technol, Wuhan 430074, Hubei, Peoples R China
来源
ALGORITHMS AND ARCHITECTURES FOR PARALLEL PROCESSING, ICA3PP 2014, PT II | 2014年 / 8631卷
关键词
Join; MapReduce; Dynamic Programming; MAPREDUCE;
D O I
暂无
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
MapReduce is an important programming model for processing big data with a parallel, distributed algorithm on a cluster. In big data analytic application, equi-join is an important operation. However, it is inefficient to perform equi-join operations in MapReduce when multiple datasets are involved in the join. In this paper, a time cost evaluation model is extended for an equi-join by considering the time cost of calculation. In addition, the sub-joins in an equi-join are classified into star pattern sub-joins on single attribute and chain pattern sub-joins. Based on the extended model, optimization methods are presented and an equi-join plan with lower time cost is chosen for the equi-join. The optimization methods include: the star pattern sub-joins on one attribute are first processed; next, a chain pattern sub-join with minimal scale of intermediate results (i.e. the number of tuples in intermediate results) is processed; at last, a chain pattern sub-join is decomposed into several MapReduce jobs or single MapReduce job by dynamic programming to obtain an optimal scheme for the chain pattern sub-join. We conducted extensive experiments, and the results show that our method is more efficient than those methods such as MDMJ, Hive and Pig.
引用
收藏
页码:122 / 135
页数:14
相关论文
共 11 条
[1]  
[Anonymous], 2010, EDBT, DOI [DOI 10.1145/1739041.1739056, 10.1145/1739041.1739056]
[2]  
[Anonymous], DATABASE SYSTEM IMPL
[3]  
[Anonymous], 2010, P ACM SIGMOD INT C M, DOI DOI 10.1145/1807167.1807273
[4]  
Dean J, 2004, USENIX ASSOCIATION PROCEEDINGS OF THE SIXTH SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION (OSDE '04), P137
[5]   Parallel Data Processing with MapReduce: A Survey [J].
Lee, Kyong-Ha ;
Lee, Yoon-Joon ;
Choi, Hyunsik ;
Chung, Yon Dohn ;
Moon, Bongki .
SIGMOD RECORD, 2011, 40 (04) :11-20
[6]  
Lee T., 2013, INFORM AN INT INTERD, V16, P5869
[7]  
Olston C., 2008, Proceedings of the 2008 ACM SIGMOD international conference on Management Of Data, SIGMOD '08, P1099
[8]  
Slagter Kenn, 2013, Grid and Pervasive Computing. 8th International Conference, GPC 2013 and Colocated Workshops. Proceedings, P73, DOI 10.1007/978-3-642-38027-3_8
[9]   Hive - A Warehousing Solution Over a Map-Reduce Framework [J].
Thusoo, Ashish ;
Sen Sarma, Joydeep ;
Jain, Namit ;
Shao, Zheng ;
Chakka, Prasad ;
Anthony, Suresh ;
Liu, Hao ;
Wyckoff, Pete ;
Murthy, Raghotham .
PROCEEDINGS OF THE VLDB ENDOWMENT, 2009, 2 (02) :1626-1629
[10]  
Yang H.-c., 2007, SIGMOD/PODS'07: 34th ACM SIGMOD International Conference on Management of Data, P1029