Estimating the selectivity of LIKE queries using pattern-based histograms

被引:2
作者
Aytimur, Mehmet [1 ]
Cakmak, Ali [1 ]
机构
[1] Istanbul Sehir Univ, Fac Engn, Dept Comp Sci, Istanbul, Turkey
关键词
Selectivity estimation; histograms; data management; sequence mining; EFFICIENT ALGORITHM;
D O I
10.3906/elk-1806-96
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Accurate cost and time estimation of a query is one of the major success indicators for database management systems. SQL allows the expression of flexible queries on text-formatted data. The LIKE operator is used to search for a specified pattern (e.g., LIKE "luck%") in a string database. It is vital to estimate the selectivity of such flexible predicates for the query optimizer to choose an efficient execution plan. In this paper, we study the problem of estimating the selectivity of a LIKE query predicate over a bag of strings. We propose a new type of pattern-based histogram structure to summarize the data distribution in a particular column. More specifically, we first mine sequential patterns over a given string database and then construct a special histogram out of the mined patterns. During query optimization time, pattern-based histograms are exploited to estimate the selectivity of a LIKE predicate. The experimental results on a real dataset from DBLP show that the proposed technique outperforms the state of the art for generic LIKE queries like %s(1)%s(2)%...%s(n) % where s(i) represents one or more characters. What is more, the proposed histogram structure requires more than two orders of magnitude smaller memory space, and the estimation time is almost an order of magnitude less in comparison to the state of the art.
引用
收藏
页码:3319 / 3334
页数:16
相关论文
共 25 条
[1]  
AGRAWAL R, 1995, PROC INT CONF DATA, P3, DOI 10.1109/ICDE.1995.380415
[2]  
[Anonymous], 2003, Proceedings of the ACM SIGMOD Intl. Conf. on Management of Data, DOI [10.1145/872757.872796, DOI 10.1145/872757.872796]
[3]  
[Anonymous], 2006, ICDE, DOI [DOI 10.1109/ICDE.2006.9, 10.1109/ICDE.2006.9]
[4]  
[Anonymous], 1996, ACM SIGMOD RECORD
[5]  
Arasu A., 2006, Proceedings of VLDB Endowment, P918
[6]   Selectivity estimation for string predicates: Overcoming the underestimation problem [J].
Chaudhuri, S ;
Ganti, V ;
Gravano, L .
20TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, PROCEEDINGS, 2004, :227-238
[7]   The string B-tree: A new data structure for string search in external memory and its applications [J].
Ferragina, P ;
Grossi, R .
JOURNAL OF THE ACM, 1999, 46 (02) :236-280
[8]  
Jagadish H. V., 1999, Proceedings of the Eighteenth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, P249, DOI 10.1145/303976.304001
[9]   One-dimensional and multi-dimensional substring selectivity estimation [J].
Jagadish, HV ;
Kapitskaia, O ;
Ng, RT ;
Srivastava, D .
VLDB JOURNAL, 2000, 9 (03) :214-230
[10]  
Jagadish HV, 1998, VLDB, P24