minIL: A Simple and Small Index for String Similarity Search with Edit Distance

被引:2
作者
Yang, Zhong [1 ]
Zheng, Bolong [1 ]
Wang, Xianzhi [2 ]
Li, Guohui [1 ]
Zhou, Xiaofang [3 ]
机构
[1] Huazhong Univ Sci & Technol, Wuhan, Peoples R China
[2] Univ Technol Sydney, Sydney, NSW, Australia
[3] Hong Kong Univ Sci & Technol, Hong Kong, Peoples R China
来源
2022 IEEE 38TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2022) | 2022年
关键词
threshold string similarity search; edit distance; invertd index; minhash; EFFICIENT ALGORITHM; JOIN;
D O I
10.1109/ICDE53745.2022.00047
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The string similarity search is core functionality in a range of applications, including data cleaning, near-duplicate object detection, and data integration. We study the problem of threshold similarity search with the edit distance, where given a set of strings, a threshold k, and a query string q, we aim to find all strings in the set whose edit distances to q are no larger than k. Extensive studies have been proposed for the threshold similarity search problem with the edit distance. However, they suffer from a huge space consumption issue when achieving only an acceptable efficiency, especially for long strings. In this paper, we propose a simple yet small index, called minIL, to eliminate this issue. First, we adopt a minhash family to capture pivot characters and to construct sketch representations for strings. Second, we develop a multi-level inverted index to search sketches with a low space consumption. Finally, we apply a novel learned index technique on top of the index that further improves the query efficiency. Extensive experiments on real-world datasets offer insight into the performance of our method and show that it substantially reduces the index size, and is capable of outperforming the baseline approaches.
引用
收藏
页码:565 / 577
页数:13
相关论文
共 28 条
[1]  
[Anonymous], 2004, P ACM SIGMOD INT C M
[2]  
[Anonymous], 2010, P ACM SIGMOD INT C M, DOI DOI 10.1145/1807167.1807266
[3]  
Arasu A., 2006, VLDB'06: Proceedings of the 32nd international conference on Very large data bases, P918
[4]  
Bayardo R.J., 2007, WWW '07, P131, DOI [DOI 10.1145/1242572.1242591, 10.1145/1242572.1242591]
[5]   Min-wise independent permutations [J].
Broder, AZ ;
Charikar, M ;
Frieze, AM ;
Mitzenmacher, M .
JOURNAL OF COMPUTER AND SYSTEM SCIENCES, 2000, 60 (03) :630-659
[6]   Streaming Algorithms for Embedding and Computing Edit Distance in the Low Distance Regime [J].
Chakraborty, Diptarka ;
Goldenberg, Elazar ;
Koucky, Michal .
STOC'16: PROCEEDINGS OF THE 48TH ANNUAL ACM SIGACT SYMPOSIUM ON THEORY OF COMPUTING, 2016, :712-725
[7]  
Chatzigiannakis I., 2018, P 45 INT C AUT LANG, V107, P34, DOI [10.4230/LIPIcs.ICALP.2018.34, DOI 10.4230/LIPICS.ICALP.2018.34]
[8]  
Ciaccia P, 1997, PROCEEDINGS OF THE TWENTY-THIRD INTERNATIONAL CONFERENCE ON VERY LARGE DATABASES, P426
[9]   A Pivotal Prefix Based Filtering Algorithm for String Similarity Search [J].
Deng, Dong ;
Li, Guoliang ;
Feng, Jianhua .
SIGMOD'14: PROCEEDINGS OF THE 2014 ACM SIGMOD INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2014, :673-684
[10]  
Deng D, 2013, PROC INT CONF DATA, P925, DOI 10.1109/ICDE.2013.6544886