Adaptive correlation exploitation in big data query optimization

被引：0

作者：

Yuchen Liu

Hai Liu

Dongqing Xiao

Mohamed Y. Eltabakh

机构：

[1] Worcester Polytechnic Institute (WPI),Computer Science Department

来源：

The VLDB Journal | 2018年 / 27卷

关键词：

Data correlations; Big data; Query optimization; Soft and hard correlations; Incremental maintenance;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

Correlations among the data attributes are abundant and inherent in most application domains. These correlations, if managed in systematic and efficient ways, would enable various optimization opportunities. Unfortunately, the state-of-art techniques are all heavily tailored toward optimizing factors intrinsic to relational databases, e.g., predicate selectivity, random I/O accesses, and secondary indexes, which are mostly not applicable to the modern big data infrastructures, e.g., Hadoop and Spark. In this paper, we propose the EXORD+\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$^+$$\end{document} system for exploiting the data’s correlations in big data query optimization. EXORD+\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$^+$$\end{document} supports two types of correlations; hard (which does not allow for exceptions) and soft (which allows for exceptions). We introduce a three-phase approach for managing soft correlations including: (1) validating and judging the worthiness of soft correlations, (2) selecting and preparing the soft correlations for deployment, and (3) exploiting the correlations in query optimization. EXORD+\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$^+$$\end{document} introduces a novel cost-benefit model for adaptively selecting the most beneficial soft correlations given a query workload. We show the complexity of this problem (NP-Hard) and propose a heuristic to efficiently solve it in a polynomial time. Moreover, we present incremental maintenance algorithms for efficiently updating the system’s state under data appends and workload changes. EXORD+\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$^+$$\end{document} prototype is implemented as an extension to the Hive engine on top of Hadoop. The experimental evaluation shows the potential of EXORD+\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$^+$$\end{document} in achieving more than 10x speedup while introducing minimal storage overheads.

引用

页码：873 / 898

页数：25

共 57 条

[1]

Bu Y(2010)Haloop: efficient iterative data processing on large clusters Proc. VLDB Endow. 3 285-296

[2]

Howe B(2010)Cheetah: a high performance, custom data warehouse on top of mapreduce Proc. VLDB Endow. 3 1459-1468

[3]

Balazinska M(2013)Discovering denial constraints PVLDB 6 1498-1509

[4]

Ernst MD(2010)Hadoop++: Making a yellow elephant run like a cheetah (without it even noticing) VLDB 3 518-529

[5]

Chen S(2012)Restore: reusing results of mapreduce jobs Proc. VLDB Endow. 5 586-597

[6]

Chu X(2011)Cohadoop: flexible data placement and its exploitation in hadoop PVLDB 4 575-585

[7]

Ilyas IF(2008)Conditional functional dependencies for capturing data inconsistencies ACM Trans. Database Syst. 33 6:1-6:48

[8]

Papotti P(1999)TANE: an efficient algorithm for discovering functional and approximate dependencies Comput. J. 42 100-111

[9]

Dittrich J(1975)Fast approximation algorithms for the knapsack and sum of subset problems J. ACM 22 463-468

[10]

Quiané-Ruiz J-A(2010)The performance of mapreduce: an in-depth study Proc. VLDB Endow. 3 472-483

← 1 2 3 4 5 6 →