MinSearch: An Efficient Algorithm for Similarity Search under Edit Distance

被引:8
作者
Zhang, Haoyu [1 ]
Zhang, Qin [1 ]
机构
[1] Indiana Univ Bloomington, Bloomington, IN 47405 USA
来源
KDD '20: PROCEEDINGS OF THE 26TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING | 2020年
关键词
JOINS;
D O I
10.1145/3394486.3403099
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We study a fundamental problem in data analytics: similarity search under edit distance (or, edit similarity search for short). In this problem we try to build an index on a set of n strings S = {s(1), . . . , s(n)}, with the goal of answering the following two types of queries: (1) the threshold query: given a query string t and a threshold K, output all s(i) is an element of S such that the edit distance between si and t is at most K; (2) the top-k query: given a query string t, output the k strings in S that are closest to t in terms of edit distance. Edit similarity search has numerous applications in bioinformatics, databases, data mining, information retrieval, etc., and has been studied extensively in the literature. In this paper we propose a novel algorithm for edit similarity search named MinSearch. The algorithm is randomized, and we can show mathematically that it outputs the correct answer with high probability for both types of queries. We have conducted an extensive set of experiments on MinSearch, and compared it with the best existing algorithms for edit similarity search. Our experiments show that MinSearch has a clear advantage (often in orders of magnitudes) against the best previous algorithms in query time, and MinSearch is always one of the best among all competitors in the indexing time and space usage. Finally, MinSearch achieves perfect accuracy for both types of queries on all datasets that we have tested.
引用
收藏
页码:566 / 576
页数:11
相关论文
共 27 条
[1]  
[Anonymous], 2010, P ACM SIGMOD INT C M, DOI DOI 10.1145/1807167.1807266
[2]  
Arasu A., 2006, VLDB'06: Proceedings of the 32nd international conference on Very large data bases, P918
[3]  
Bayardo R. J., 2007, P 16 INT C WORLD WID, P131, DOI DOI 10.1145/1242572.1242591
[4]  
Chaudhuri S., 2003, P 2003 ACM SIGMOD IN, DOI [10.1 145/872757.872796, 10.1145/872757.872796, DOI 10.1145/872757.872796]
[5]   A Pivotal Prefix Based Filtering Algorithm for String Similarity Search [J].
Deng, Dong ;
Li, Guoliang ;
Feng, Jianhua .
SIGMOD'14: PROCEEDINGS OF THE 2014 ACM SIGMOD INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2014, :673-684
[6]  
Deng D, 2013, PROC INT CONF DATA, P925, DOI 10.1109/ICDE.2013.6544886
[7]  
Gionis A, 1999, PROCEEDINGS OF THE TWENTY-FIFTH INTERNATIONAL CONFERENCE ON VERY LARGE DATA BASES, P518
[8]  
Gravano L., 2001, Proceedings of the 27th International Conference on Very Large Data Bases, P491
[9]  
Hadjieleftheriou M, 2009, ACM SIGMOD/PODS 2009 CONFERENCE, P429
[10]  
Kahveci T., 2001, Proceedings of the 27th International Conference on Very Large Data Bases, P351