Boosting the Quality of Approximate String Matching by Synonyms

被引:8
作者
Lu, Jiaheng [1 ,2 ]
Lin, Chunbin [3 ]
Wang, Wei [4 ]
Li, Chen [5 ]
Xiao, Xiaokui [6 ]
机构
[1] Renmin Univ China, Beijing, Peoples R China
[2] Univ Helsinki, Dept Comp Sci, FI-00014 Helsinki, Finland
[3] Univ Calif San Diego, La Jolla, CA 92093 USA
[4] Univ New S Wales, Sch Engn & Comp Sci, Sydney, NSW 2052, Australia
[5] Univ Calif Irvine, Dept Comp Sci, Irvine, CA 92697 USA
[6] Nanyang Technol Univ, Sch Engn & Comp Sci, Singapore 639798, Singapore
来源
ACM TRANSACTIONS ON DATABASE SYSTEMS | 2015年 / 40卷 / 03期
关键词
Algorithms; Experimentation; Performance; Theory; String similarity search; similarity join; semantic search; SIMILARITY JOINS; ALGORITHMS;
D O I
10.1145/2818177
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
A string-similarity measure quantifies the similarity between two text strings for approximate string matching or comparison. For example, the strings "Sam" and "Samuel" can be considered to be similar. Most existing work that computes the similarity of two strings only considers syntactic similarities, for example, number of common words or q-grams. While this is indeed an indicator of similarity, there are many important cases where syntactically-different strings can represent the same real-world object. For example, "Bill" is a short form of "William," and "Database Management Systems" can be abbreviated as "DBMS." Given a collection of predefined synonyms, the purpose of this article is to explore such existing knowledge to effectively evaluate the similarity between two strings and efficiently perform similarity searches and joins, thereby boosting the quality of approximate string matching. In particular, we first present an expansion-based framework to measure string similarities efficiently while considering synonyms. We then study efficient algorithms for similarity searches and joins by proposing two novel indexes, called SI-trees and QP-trees, which combine signature-filtering and length-filtering strategies. In order to improve the efficiency of our algorithms, we develop an estimator to estimate the size of candidates to enable an online selection of signature filters. This estimator provides strong low-error, high-confidence guarantees while requiring only logarithmic space and time costs, thus making our method attractive both in theory and in practice. Finally, the experimental results from a comprehensive study of the algorithms with three real datasets verify the effectiveness and efficiency of our approaches.
引用
收藏
页数:42
相关论文
共 36 条
[1]  
Alon N., 1996, Proceedings of the Twenty-Eighth Annual ACM Symposium on the Theory of Computing, P20, DOI 10.1145/237814.237823
[2]  
[Anonymous], 2003, 9 ACM SIGKDD INTCONF, DOI DOI 10.1145/956750.956759
[3]  
[Anonymous], 2013, P ACM SIGMOD INT C M, DOI DOI 10.1145/2463676.2465313
[4]  
[Anonymous], 2003, Proceedings of the ACM SIGMOD Intl. Conf. on Management of Data, DOI [DOI 10.1145/872757.872796, 10.1145/872757.872796]
[5]  
[Anonymous], 2012, SIGMOD Conference
[6]  
[Anonymous], 2003, IIWeb
[7]  
Arasu A., 2006, Proceedings of VLDB Endowment, P918
[8]   Transformation-based framework for record matching [J].
Arasu, Arvind ;
Chaudhuri, Surajit ;
Kaushik, Raghav .
2008 IEEE 24TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, VOLS 1-3, 2008, :40-49
[9]  
Arasu Arvind., 2009, PVLDB, V2, P514
[10]  
Bar-Yossef Z., 2002, Randomization and Approximation Techniques in Computer Science. 6th International Workshop, RANDOM 2002. Proceedings (Lecture Notes in Computer Science Vol.2483), P1