Extending the Bag Distance for String Similarity Search

被引:0
作者
Mergen S. [1 ]
机构
[1] Departamento de Linguagens e Sistemas de Computação, Universidade Federal de Santa Maria, Avenida Roraima, Rio Grande do Sul, Santa Maria
关键词
Bag Distance; Edit Distance; Metric spaces; String similarity;
D O I
10.1007/s42979-022-01502-5
中图分类号
学科分类号
摘要
In the context of string similarity search, the Edit Distance is the preferred choice for indexes based on a metric space. However, the high distances between strings lead to indexes with low pruning factors. Besides, computing the distances is time consuming. An alternative is the Bag Distance, whose computational cost is lower. In this paper, we propose an extension of the Bag Distance (The Anagram Distance) that allows non-uniform costs. The extension is more compatible to the Edit Distance and its applications. We also transform the index space into one that uses an Anagram Distance as the metric function, leaving the Edit Distance computation to a validation phase. As we describe, the transformation increases the pruning factor of in-memory indexes, specially when the costs are non-uniform. Experiments report the improvements achieved during search, both in terms of execution time and the number of distance computations. © 2022, The Author(s), under exclusive licence to Springer Nature Singapore Pte Ltd.
引用
收藏
相关论文
共 28 条
[1]  
Navarro G., Searching in metric spaces by spatial approximation, VLDB J, 11, 1, pp. 28-46, (2002)
[2]  
Novak D., Batko M., Zezula P., Metric index: an efficient and scalable solution for precise and approximate similarity search, Inf Syst, 36, 4, pp. 721-733, (2011)
[3]  
Chen L., Gao Y., Zheng B., Jensen C.S., Yang H., Yang K., Pivot-based metric indexing, Proc VLDB Endow, 10, 10, pp. 1058-1069, (2017)
[4]  
Levenshtein V.I., Binary codes capable of correcting deletions, insertions, and reversals, Soviet Physics Doklady, 10, pp. 707-710, (1966)
[5]  
Traina C., Traina A.J., Vieira M.R., Faloutsos C., Et al., The omni-family of all-purpose access methods: a simple and effective way to make similarity search more efficient, VLDB J-Int J Very Large Data Bases, 16, 4, pp. 483-505, (2007)
[6]  
Traina C., Traina A., Seeger B., Faloutsos C., Slim-trees: High performance metric trees minimizing overlap between nodes, International Conference on Extending Database Technology. Springer, pp. 51-65, (2000)
[7]  
Deng D., Li G., Wen H., Jagadish H., Feng J., Meta: an efficient matching-based method for error-tolerant autocompletion, Proc VLDB Endow, 9, 10, pp. 828-839, (2016)
[8]  
Zezula P., Amato G., Dohnal V., Batko M., Similarity search: the metric space approach, 32, (2006)
[9]  
Bartolini I., Ciaccia P., Patella M., String matching with metric trees using an approximate distance, Proceedings of the 9Th International Symposium on String Processing and Information Retrieval. SPIRE 2002, pp. 271-283, (2002)
[10]  
Chavez E., Navarro G., A compact space decomposition for effective metric indexing, Pattern Recogn Lett, 26, 9, pp. 1363-1376, (2005)