Efficient processing of substring match queries with inverted variable-length gram indexes

被引:3
作者
Kim, Younghoon [1 ]
Park, Hyoungmin [2 ]
Shim, Kyuseok [1 ]
Woo, Kyoung-Gu [2 ]
机构
[1] Seoul Natl Univ, Sch EE CS, Seoul 151600, South Korea
[2] Samsung Elect, SAIT, Kiheung, South Korea
基金
新加坡国家研究基金会;
关键词
Database management; Query processing; Substring matching; q-Gram index; Variable-length gram; SEARCH; SELECTIVITY;
D O I
10.1016/j.ins.2013.04.037
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
With the widespread use of the internet, text-based data sources have become ubiquitous and the demand for effective support of string matching queries continues to increase. The relational query language SQL supports LIKE clauses for string data to handle substring matching queries. Due to the popularity of such substring matching queries, there have been many studies on designing efficient indexes to support the LIKE clause in SQL Among them, the indexes based on q-grams with fixed or variable sizes have been studied extensively. In this paper, we show that the optimal processing of substring matching queries with inverted variable-length gram indexes should be decided judiciously. However, the search space of finding the optimal variable-length gram set among those available is exponential with the number of grams. To reduce the search space, we present effective pruning strategies which do not sacrifice optimality. Based on the cost estimation, we propose the optimal algorithms to find the best plan with the minimum cost for substring matching queries using these pruning strategies. We also provide the approximate algorithms to overcome the exponential nature of search space to find an optimal query plan. Our performance study confirms that the proposed techniques improve query execution time for substring matching significantly compared to the running time of the traditional algorithms. (C) 2013 Elsevier Inc. All rights reserved.
引用
收藏
页码:119 / 141
页数:23
相关论文
共 50 条
[1]  
Adams E., 1993, CSC, P433, DOI [10.1145/170791.170891, DOI 10.1145/170791.170891]
[2]   An improvement of the Aho-Corasick machine [J].
Ando, K ;
Kinoshita, T ;
Shishibori, M ;
Aoe, J .
INFORMATION SCIENCES, 1998, 111 (1-4) :139-151
[3]  
[Anonymous], 1973, SORTING SEARCHING
[4]  
[Anonymous], 2004, SIGMOD
[5]  
[Anonymous], 2003, Proceedings of the ACM SIGMOD Intl. Conf. on Management of Data, DOI [DOI 10.1145/872757.872796, 10.1145/872757.872796]
[6]  
Arasu A., 2006, Proceedings of VLDB Endowment, P918
[7]  
Baeza-Yates R, 1999, MODERN INFORM RETRIE, V463
[8]  
Bayardo R.J., 2007, WWW INT C WORLD WID, P131, DOI [DOI 10.1145/1242572.1242591, 10.1145/1242572.1242591]
[9]   The anatomy of a large-scale hypertextual Web search engine [J].
Brin, S ;
Page, L .
COMPUTER NETWORKS AND ISDN SYSTEMS, 1998, 30 (1-7) :107-117
[10]   On the resemblance and containment of documents [J].
Broder, AZ .
COMPRESSION AND COMPLEXITY OF SEQUENCES 1997 - PROCEEDINGS, 1998, :21-29