FrepJoin: an efficient partition-based algorithm for edit similarity join

被引:3
作者
Luo, Ji-zhou [1 ,2 ]
Shi, Sheng-fei [1 ]
Wang, Hong-zhi [1 ]
Li, Jian-zhong [1 ]
机构
[1] Harbin Inst Technol, Sch Comp Sci & Technol, Harbin 150001, Heilongjiang, Peoples R China
[2] Guangdong Key Lab Popular High Performance Comp, Key Lab Serv Comp & Applicat, Shenzhen 518000, Peoples R China
关键词
String similarity join; Edit distance; Filter and refine; Data partition; Combined frequency vectors;
D O I
10.1631/FITEE.1601347
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
String similarity join (SSJ) is essential for many applications where near-duplicate objects need to be found. This paper targets SSJ with edit distance constraints. The existing algorithms usually adopt the filter-and-refine framework. They cannot catch the dissimilarity between string subsets, and do not fully exploit the statistics such as the frequencies of characters. We investigate to develop a partition-based algorithm by using such statistics. The frequency vectors are used to partition datasets into data chunks with dissimilarity between them being caught easily. A novel algorithm is designed to accelerate SSJ via the partitioned data. A new filter is proposed to leverage the statistics to avoid computing edit distances for a noticeable proportion of candidate pairs which survive the existing filters. Our algorithm outperforms alternative methods notably on real datasets.
引用
收藏
页码:1499 / 1510
页数:12
相关论文
共 25 条
[1]   Fuzzy Joins Using MapReduce [J].
Afrati, Foto N. ;
Das Sarma, Anish ;
Menestrina, David ;
Parameswaran, Aditya ;
Ullman, Jeffrey D. .
2012 IEEE 28TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), 2012, :498-509
[2]  
[Anonymous], 2004, SIGMOD
[3]  
[Anonymous], 2003, Proceedings of the ACM SIGMOD Intl. Conf. on Management of Data, DOI [10.1145/872757.872796, DOI 10.1145/872757.872796]
[4]  
[Anonymous], 2009, Proceedings of the 18th international conference on World wide web, WWW '09, DOI 10.1145/1526709.1526760
[5]  
[Anonymous], 2007, WWW INT C WORLD WID
[6]  
Arasu A., 2006, Proceedings of VLDB Endowment, P918
[7]  
Bayardo R.J., 2007, WWW, P131, DOI [DOI 10.1145/1242572.1242591, 10.1145/1242572.1242591]
[8]  
Chaudhuri S., 2006, INT C DAT ENG, P687
[9]  
Chaudhuri S., 2006, IEEE DATA ENG B, V29, P60
[10]  
Dong X.L., 2007, VLDB, P687