Deep Learning Models for Selectivity Estimation of Multi-Attribute Queries

被引:58
作者
Hasan, Shohedul [1 ]
Thirumuruganathan, Saravanan [2 ]
Augustine, Jees [1 ]
Koudas, Nick [3 ]
Das, Gautam [1 ]
机构
[1] UT Arlington, Arlington, TX 76019 USA
[2] HBKU, QCRI, Doha, Qatar
[3] Univ Toronto, Toronto, ON, Canada
来源
SIGMOD'20: PROCEEDINGS OF THE 2020 ACM SIGMOD INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA | 2020年
基金
美国国家科学基金会;
关键词
selectivity estimation; cardinality estimation; deep learning; density estimation; neural autoregressive models; MADE;
D O I
10.1145/3318464.3389741
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Selectivity estimation - the problem of estimating the result size of queries - is a fundamental problem in databases. Accurate estimation of query selectivity involving multiple correlated attributes is especially challenging. Poor cardinality estimates could result in the selection of bad plans by the query optimizer. Recently, deep learning has been applied to this problem with promising results. However, many of the proposed approaches often struggle to provide accurate results for multi attribute queries involving large number of predicates and with low selectivity. In this paper, we propose two complementary approaches that are effective for this scenario. Our first approach models selectivity estimation as a density estimation problem where one seeks to estimate the joint probability distribution from a finite number of samples. We leverage techniques from neural density estimation to build an accurate selectivity estimator. The key idea is to decompose the joint distribution into a set of tractable conditional probability distributions such that they satisfy the autoregressive property. Our second approach formulates selectivity estimation as a supervised deep learning problem that predicts the selectivity of a given query. We describe how to extend our algorithms for range queries. We also introduce and address a number of practical challenges arising when adapting deep learning for relational data. These include query/data featurization, incorporating query workload information in a deep learning framework and the dynamic scenario where both data and workload queries could be updated. Our extensive experiments with a special emphasis on queries with a large number of predicates and/or small result sizes demonstrates that our proposed techniques provide fast and accurate selective estimates with minimal space overhead.
引用
收藏
页码:1035 / 1050
页数:16
相关论文
共 59 条
[1]   Learning-based Query Performance Modeling and Prediction [J].
Akdere, Mert ;
Cetintemel, Ugur ;
Riondato, Matteo ;
Upfal, Eli ;
Zdonik, Stanley B. .
2012 IEEE 28TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), 2012, :390-401
[2]  
[Anonymous], 1989, Applied Linear Regression Models
[3]  
[Anonymous], 1996, ACM SIGMOD RECORD
[4]  
Bruno N, 2001, SIGMOD RECORD, V30, P211, DOI 10.1145/376284.375686
[5]  
Bruno N., 2004, P ACM SIGMOD INT C M, P311
[6]   Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches [J].
Cormode, Graham ;
Garofalakis, Minos ;
Haas, Peter J. ;
Jermaine, Chris .
FOUNDATIONS AND TRENDS IN DATABASES, 2011, 4 (1-3) :1-294
[7]  
Das G., 2020, ICDE
[8]  
Doan A., 2020, P EDBT
[9]  
Dougherty J., 1995, Machine Learning. Proceedings of the Twelfth International Conference on Machine Learning, P194
[10]   Selectivity Estimation for Range Predicates using Lightweight Models [J].
Dutt, Anshuman ;
Wang, Chi ;
Nazi, Azade ;
Kandula, Srikanth ;
Narasayya, Vivek ;
Chaudhuri, Surajit .
PROCEEDINGS OF THE VLDB ENDOWMENT, 2019, 12 (09) :1044-1057