MinJoin++: a fast algorithm for string similarity joins under edit distance

被引:0
作者
Nikolai Karpov
Haoyu Zhang
Qin Zhang
机构
[1] Indiana University,
[2] Meta Inc.,undefined
来源
The VLDB Journal | 2024年 / 33卷
关键词
String similarity joins; Edit distance; Local hash minima;
D O I
暂无
中图分类号
学科分类号
摘要
We study the problem of computing similarity joins under edit distance on a set of strings. Edit similarity joins is a fundamental problem in databases, data mining and bioinformatics. It finds many applications in data cleaning and integration, collaborative filtering, genome sequence assembly, etc. This problem has attracted a lot of attention in the past two decades. However, all previous algorithms either cannot scale to long strings and large similarity thresholds, or suffer from imperfect accuracy. In this paper, we propose a new algorithm for edit similarity joins using a novel string partition-based approach. We show that, theoretically, our algorithm finds all similar pairs with high probability and runs in linear time (plus a data-dependent verification step). The algorithm can also be easily parallelized. Experiments on real-world datasets show that our algorithm outperforms the state-of-the-art algorithms for edit similarity joins by orders of magnitudes in running time and achieves perfect accuracy on most datasets that we have tested.
引用
收藏
页码:281 / 299
页数:18
相关论文
共 38 条
[1]  
Jiang Y(2014)String similarity joins: an experimental evaluation PVLDB 7 625-636
[2]  
Li G(2011)PASS-JOIN: a partition-based method for similarity joins PVLDB 5 253-264
[3]  
Feng J(2013)The advantages of SMRT sequencing Genome Biol. 14 405-4845
[4]  
Li W(2020)Overlap detection on long, error-prone sequencing reads via smooth q-gram Bioinformatics 36 4838-118
[5]  
Li G(1985)Algorithms for approximate string matching Inf. Control 64 100-76
[6]  
Deng D(2014)State-of-the-art in string similarity search and join SIGMOD Record 43 64-1230
[7]  
Wang J(2010)Trie-join: efficient trie-based string similarity joins with edit-distance constraints PVLDB 3 1219-1929
[8]  
Feng J(2013)Vchunkjoin: an efficient algorithm for edit similarity joins IEEE Trans. Knowl. Data Eng. 25 1916-944
[9]  
Roberts RJ(2008)Ed-join: an efficient algorithm for similarity joins with edit distance constraints PVLDB 1 933-undefined
[10]  
Carneiro MO(undefined)undefined undefined undefined undefined-undefined