Improving the performance of Hadoop Hive by sharing scan and computation tasks

被引:12
作者
Dokeroglu T. [1 ]
Ozal S. [1 ]
Bayir M.A. [2 ]
Cinar M.S. [3 ]
Cosar A. [1 ]
机构
[1] Middle East Technical University Computer Engineering Department, Cankaya, Ankara
[2] Microsoft Research Redmond, One Microsoft Way, Redmond
[3] Hacettepe University Computer Engineering Department, Cankaya, Ankara
关键词
Data warehouse; Hadoop; Hive; Multiple-query optimization;
D O I
10.1186/s13677-014-0012-6
中图分类号
学科分类号
摘要
MapReduce is a popular programming model for executing time-consuming analytical queries as a batch of tasks on large scale data clusters. In environments where multiple queries with similar selection predicates, common tables, and join tasks arrive simultaneously, many opportunities can arise for sharing scan and/or join computation tasks. Executing common tasks only once can remarkably reduce the total execution time of a batch of queries. In this study, we propose a Multiple Query Optimization framework, SharedHive, to improve the overall performance of Hadoop Hive, an open source SQL-based data warehouse using MapReduce. SharedHive transforms a set of correlated HiveQL queries into a new set of insert queries that will produce all of the required outputs within a shorter execution time. It is experimentally shown that SharedHive achieves significant reductions in total execution times of TPC-H queries. © 2014, Dokeroglu et al.; licensee Springer.
引用
收藏
页码:1 / 11
页数:10
相关论文
共 40 条
  • [1] Dean J., Ghemawat S., MapReduce: simplified data processing on large clusters, Commun ACM, 51, 1, pp. 107-113, (2008)
  • [2] Condie T., Conway N., Alvaro P., Hellerstein J.M., Elmeleegy K., Sears R., MapReduce online In: Proceedings of the 7th, USENIX conference on Networked systems design and implementation, April 28–30, (2010)
  • [3] Stonebraker M., Abadi D., DeWitt D.J., Madden S., Paulson E., Pavlo A., Rasin A., MapReduce and parallel DBMSs: friends or foes?, Commun ACM, 53, 1, pp. 64-71, (2010)
  • [4] DeWitt D., Stonebraker M., MapReduce: A major step backwards, The Database Column, (2008)
  • [5] He Y., Lee R., Huai Y., Shao Z., Jain N., Zhang X., Xu Z., (2011) Rcfile: A fast and space-efficient data placement structure in mapreduce-based warehouse systems In: Proceedings of the, IEEE 27th International, 1199-1208, pp. 11-16, (2011)
  • [6] Kang U., Tsourakakis C.E., Faloutsos C., Pegasus: mining peta-scale graphs, Knowl Inf Syst, 27, 2, pp. 303-325, (2011)
  • [7] Grolinger K., Higashino W.A., Tiwari A., Capretz M.A., Data management in cloud environments: NoSQL and NewSQL data stores, J Cloud Comput: Adv Syst Appl, 2, 1, (2013)
  • [8] Bayir M.A., Toroslu I.H., Cosar A., Fidan G., Smart miner: a new framework for mining large scale web usage data In: Proceedings of the 18th international conference on World wide web, 161–170, ACM, (2009)
  • [9] Abouzeid A., Bajda-Pawlikowski K., Abadi D., Silberschatz A., Rasin A., HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads, Proc VLDB, 2, 1, pp. 922-933, (2009)
  • [10] Dai W., Bassiouni M., An improved task assignment scheme for Hadoop running in the clouds, J Cloud Comput: Adv Syst Appl, 2, 1, pp. 1-16, (2013)