Adaptive correlation exploitation in big data query optimization

被引:0
作者
Yuchen Liu
Hai Liu
Dongqing Xiao
Mohamed Y. Eltabakh
机构
[1] Worcester Polytechnic Institute (WPI),Computer Science Department
来源
The VLDB Journal | 2018年 / 27卷
关键词
Data correlations; Big data; Query optimization; Soft and hard correlations; Incremental maintenance;
D O I
暂无
中图分类号
学科分类号
摘要
Correlations among the data attributes are abundant and inherent in most application domains. These correlations, if managed in systematic and efficient ways, would enable various optimization opportunities. Unfortunately, the state-of-art techniques are all heavily tailored toward optimizing factors intrinsic to relational databases, e.g., predicate selectivity, random I/O accesses, and secondary indexes, which are mostly not applicable to the modern big data infrastructures, e.g., Hadoop and Spark. In this paper, we propose the EXORD+\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$^+$$\end{document} system for exploiting the data’s correlations in big data query optimization. EXORD+\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$^+$$\end{document} supports two types of correlations; hard (which does not allow for exceptions) and soft (which allows for exceptions). We introduce a three-phase approach for managing soft correlations including: (1) validating and judging the worthiness of soft correlations, (2) selecting and preparing the soft correlations for deployment, and (3) exploiting the correlations in query optimization. EXORD+\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$^+$$\end{document} introduces a novel cost-benefit model for adaptively selecting the most beneficial soft correlations given a query workload. We show the complexity of this problem (NP-Hard) and propose a heuristic to efficiently solve it in a polynomial time. Moreover, we present incremental maintenance algorithms for efficiently updating the system’s state under data appends and workload changes. EXORD+\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$^+$$\end{document} prototype is implemented as an extension to the Hive engine on top of Hadoop. The experimental evaluation shows the potential of EXORD+\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$^+$$\end{document} in achieving more than 10x speedup while introducing minimal storage overheads.
引用
收藏
页码:873 / 898
页数:25
相关论文
共 57 条
[1]  
Bu Y(2010)Haloop: efficient iterative data processing on large clusters Proc. VLDB Endow. 3 285-296
[2]  
Howe B(2010)Cheetah: a high performance, custom data warehouse on top of mapreduce Proc. VLDB Endow. 3 1459-1468
[3]  
Balazinska M(2013)Discovering denial constraints PVLDB 6 1498-1509
[4]  
Ernst MD(2010)Hadoop++: Making a yellow elephant run like a cheetah (without it even noticing) VLDB 3 518-529
[5]  
Chen S(2012)Restore: reusing results of mapreduce jobs Proc. VLDB Endow. 5 586-597
[6]  
Chu X(2011)Cohadoop: flexible data placement and its exploitation in hadoop PVLDB 4 575-585
[7]  
Ilyas IF(2008)Conditional functional dependencies for capturing data inconsistencies ACM Trans. Database Syst. 33 6:1-6:48
[8]  
Papotti P(1999)TANE: an efficient algorithm for discovering functional and approximate dependencies Comput. J. 42 100-111
[9]  
Dittrich J(1975)Fast approximation algorithms for the knapsack and sum of subset problems J. ACM 22 463-468
[10]  
Quiané-Ruiz J-A(2010)The performance of mapreduce: an in-depth study Proc. VLDB Endow. 3 472-483