Processing string similarity search in external memory efficiently

被引:0
作者
Wang, Jinbao [1 ]
Gao, Hong [2 ]
Li, Jianzhong [2 ]
Yang, Donghua [1 ]
机构
[1] Center for High Performance Computing, Academy of Fundamental and Interdisciplinary Sciences, Harbin Institute of Technology, Harbin
[2] School of Computer Science and Technology, Harbin Institute of Technology, Harbin
来源
Jisuanji Yanjiu yu Fazhan/Computer Research and Development | 2015年 / 52卷 / 03期
关键词
Edit distance; External memory; Query processing; Similarity search; String;
D O I
10.7544/issn1000-1239.2015.20130683
中图分类号
学科分类号
摘要
String similarity search is fundamental to various applications, such as data cleaning, spell checking, bioinformatics and information integration, since users tend to provide fuzzy queries in these applications due to input errors of both queried strings and data strings. With the rapid growth of data size, big string datasets become ubiquitous, and almost every modern information system stores data presented in string format. String similarity search should be well supported in modern information systems. Memory based q-gram inverted indexes fail to support string similarity search over big string datasets due to the memory constraint, and it can no longer work if the data size grows larger than memory size. Existing external memory method, Behm-Index, only supports length-filter and prefix filter. This paper proposes LPA-Index to reduce I/O cost for better query response time. It is a disk resident index which suffers no limitation on data size compared to memory size. LPA-Index supports both length-filter and positional filter which reduce query candidates efficiently, and it selectively reads inverted lists during query processing for better I/O performance. Experiment results on both real datasets and a synthetic dataset demonstrate the efficiency of LPA-Index and its advantages over existing state of art disk index Behm-Index with regard to I/O cost and query response time. ©, 2015, Jisuanji Yanjiu yu Fazhan/Computer Research and Development. All right reserved.
引用
收藏
页码:738 / 748
页数:10
相关论文
共 13 条
[1]  
Behm A., Li C., Et al., Answering approximate string queries on large data sets using external memory, Proc of IEEE ICDE'11, pp. 888-899, (2011)
[2]  
Wagner R., Fischer M., The string-to-string correction problem, Journal of the ACM, 21, 1, pp. 168-173, (1974)
[3]  
Zobel J., Moffat A., Inverted files for text search engines, ACM Computer Survey, 38, 2, pp. 6-20, (2006)
[4]  
Gravano L., Ipeirotis P., Et al., Approximate string join in a database (almost) for free, Proc of VLDB'01, pp. 491-500, (2001)
[5]  
Chaudhuri S., Ganti V., Et al., A primitive operator for similarity joins in data cleaning, Proc of IEEE ICDE'06, pp. 5-15, (2006)
[6]  
Xiao C., Wang W., Et al., Ed-join: An efficient algorithm for similarity joins with edit distance constrains, Proceedings of the VLDB Endowment, 1, 1, pp. 933-944, (2008)
[7]  
Li C., Lu J., Et al., Efficient merging and filtering algorithms for approximate string searches, Proc of IEEE ICDE'08, pp. 257-266, (2008)
[8]  
Sarawagi S., Kirpal A., Efficient set joins on similarity predicates, Proc of ACM SIGMOD'04, pp. 743-754, (2004)
[9]  
Behm A., Ji S., Et al., Space-constrained gram-based indexing for efficient approximate string search, Proc of IEEE ICDE'09, pp. 204-215, (2009)
[10]  
Hadijieleftheriou M., Koudas N., Et al., Increamental maintenance of length normalized indexes for approximate string matching, Proc of ACM SIGMOD'10, pp. 429-440, (2011)