Adaptive Windows for Duplicate Detection

被引:37
作者
Draisbach, Uwe [1 ]
Naumann, Felix [1 ]
Szott, Sascha [2 ]
Wonneberg, Oliver [2 ,3 ]
机构
[1] Hasso Plattner Inst, Potsdam, Germany
[2] Zuse Inst, Berlin, Germany
[3] R Lindner GmbH & Co KG, Berlin, Germany
来源
2012 IEEE 28TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE) | 2012年
关键词
D O I
10.1109/ICDE.2012.20
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Duplicate detection is the task of identifying all groups of records within a data set that represent the same real-world entity, respectively. This task is difficult, because (i) representations might differ slightly, so some similarity measure must be defined to compare pairs of records and (ii) data sets might have a high volume making a pair-wise comparison of all records infeasible. To tackle the second problem, many algorithms have been suggested that partition the data set and compare all record pairs only within each partition. One well-known such approach is the Sorted Neighborhood Method (SNM), which sorts the data according to some key and then advances a window over the data comparing only records that appear within the same window. We propose with the Duplicate Count Strategy (DCS) a variation of SNM that uses a varying window size. It is based on the intuition that there might be regions of high similarity suggesting a larger window size and regions of lower similarity suggesting a smaller window size. Next to the basic variant of DCS, we also propose and thoroughly evaluate a variant called DCS++ which is provably better than the original SNM in terms of efficiency (same results with fewer comparisons).
引用
收藏
页码:1073 / 1083
页数:11
相关论文
共 24 条
[1]  
[Anonymous], 2010, P INT WORKSH QUAL DA
[2]  
[Anonymous], 2007, Quality Measures in Data Mining, DOI DOI 10.1007/978-3-540-44918-8_6
[3]  
[Anonymous], 2003, 9 ACM SIGKDD INTCONF, DOI DOI 10.1145/956750.956759
[4]  
[Anonymous], 2010, SYNTHESIS LECT DATA
[5]  
Baxter R., 2003, ACM SIGKDD 03 WORKSH, P25, DOI DOI 10.1007/978-3-319-11257-2
[6]  
Benjelloun O., VLDB J
[7]  
Christen P, 2005, LECT NOTES COMPUT SC, V3578, P109
[8]  
Christen P., 2011, IEEE T KNOWLEDGE DAT, VPP, P1
[9]  
Cu LF, 2004, SIAM PROC S, P477
[10]  
Dong X., 2005, P 2005 ACM SIGMOD IN, P85, DOI DOI 10.1145/1066157.1066168