Fast-Join: An Efficient Method for Fuzzy Token Matching based String Similarity Join

被引:0
作者
Wang, Jiannan [1 ]
Li, Guoliang [1 ]
Fe, Jianhua [1 ]
机构
[1] Tsinghua Univ, Dept Comp Sci & Technol, Beijing 100084, Peoples R China
来源
IEEE 27TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2011) | 2011年
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
String similarity join that finds similar string pairs between two string sets is an essential operation in many applications, and has attracted significant attention recently in the database community. A significant challenge in similarity join is to implement an effective fuzzy match operation to find all similar string pairs which may not match exactly. In this paper, we propose a new similarity metrics, called "fuzzy token matching based similarity", which extends token-based similarity functions (e.g., Jaccard similarity and Cosine similarity) by allowing fuzzy match between two tokens. We study the problem of similarity join using this new similarity metrics and present a signature-based method to address this problem. We propose new signature schemes and develop effective pruning techniques to improve the performance. Experimental results show that our approach achieves high efficiency and result quality, and significantly outperforms state-of-the-art methods.
引用
收藏
页码:458 / 469
页数:12
相关论文
共 23 条
[1]  
[Anonymous], 2004, SIGMOD
[2]  
[Anonymous], WWW
[3]  
[Anonymous], 2003, Proceedings of the ACM SIGMOD Intl. Conf. on Management of Data, DOI [DOI 10.1145/872757.872796, 10.1145/872757.872796]
[4]  
[Anonymous], IFI200702 U ZUR DEP
[5]  
Arasu A., 2006, Proceedings of VLDB Endowment, P918
[6]   Transformation-based framework for record matching [J].
Arasu, Arvind ;
Chaudhuri, Surajit ;
Kaushik, Raghav .
2008 IEEE 24TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, VOLS 1-3, 2008, :40-49
[7]  
Bayardo R.J., 2007, WWW INT C WORLD WID, P131, DOI [DOI 10.1145/1242572.1242591, 10.1145/1242572.1242591]
[8]   A SIMPLE AND FAST LABEL CORRECTING ALGORITHM FOR SHORTEST PATHS [J].
BERTSEKAS, DP .
NETWORKS, 1993, 23 (08) :703-709
[9]  
Chaudhuri S., 2006, ICDE, DOI [10.1109/ICDE.2006.9, DOI 10.1109/ICDE.2006.9]
[10]  
Gravano L., 2001, Proceedings of the 27th International Conference on Very Large Data Bases, P491