FrepJoin: an efficient partition-based algorithm for edit similarity join

被引:3
作者
Luo, Ji-zhou [1 ,2 ]
Shi, Sheng-fei [1 ]
Wang, Hong-zhi [1 ]
Li, Jian-zhong [1 ]
机构
[1] Harbin Inst Technol, Sch Comp Sci & Technol, Harbin 150001, Heilongjiang, Peoples R China
[2] Guangdong Key Lab Popular High Performance Comp, Key Lab Serv Comp & Applicat, Shenzhen 518000, Peoples R China
关键词
String similarity join; Edit distance; Filter and refine; Data partition; Combined frequency vectors;
D O I
10.1631/FITEE.1601347
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
String similarity join (SSJ) is essential for many applications where near-duplicate objects need to be found. This paper targets SSJ with edit distance constraints. The existing algorithms usually adopt the filter-and-refine framework. They cannot catch the dissimilarity between string subsets, and do not fully exploit the statistics such as the frequencies of characters. We investigate to develop a partition-based algorithm by using such statistics. The frequency vectors are used to partition datasets into data chunks with dissimilarity between them being caught easily. A novel algorithm is designed to accelerate SSJ via the partitioned data. A new filter is proposed to leverage the statistics to avoid computing edit distances for a noticeable proportion of candidate pairs which survive the existing filters. Our algorithm outperforms alternative methods notably on real datasets.
引用
收藏
页码:1499 / 1510
页数:12
相关论文
共 25 条
[21]  
Wang JN, 2011, PROC INT CONF DATA, P458, DOI 10.1109/ICDE.2011.5767865
[22]   VChunkJoin: An Efficient Algorithm for Edit Similarity Joins [J].
Wang, Wei ;
Qin, Jianbin ;
Xiao, Chuan ;
Lin, Xuemin ;
Shen, Heng Tao .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2013, 25 (08) :1916-1929
[23]  
Wang W, 2009, ACM SIGMOD/PODS 2009 CONFERENCE, P759
[24]  
Xiao C., 2008, WWW, P131
[25]   Ed-Join: An Efficient Algorithm for Similarity Joins With Edit Distance Constraints [J].
Xiao, Chuan ;
Wang, Wei ;
Lin, Xuemin .
PROCEEDINGS OF THE VLDB ENDOWMENT, 2008, 1 (01) :933-944