An incremental clustering scheme for data de-duplication

被引:19
作者
Costa, Gianni [1 ]
Manco, Giuseppe [1 ]
Ortale, Riccardo [1 ]
机构
[1] ICAR CNR, I-87036 Arcavacata Di Rende, CS, Italy
基金
英国科研创新办公室;
关键词
Clustering-mining methods and algorithms; Record classification; Indexing methods and structures; Locality-sensitive hashing; Min-wise independent permutations; Approximated similarity measures; De-duplication;
D O I
10.1007/s10618-009-0155-0
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We propose an incremental technique for discovering duplicates in large databases of textual sequences, i.e., syntactically different tuples, that refer to the same real-world entity. The problem is approached from a clustering perspective: given a set of tuples, the objective is to partition them into groups of duplicate tuples. Each newly arrived tuple is assigned to an appropriate cluster via nearest-neighbor classification. This is achieved by means of a suitable hash-based index, that maps any tuple to a set of indexing keys and assigns tuples with high syntactic similarity to the same buckets. Hence, the neighbors of a query tuple can be efficiently identified by simply retrieving those tuples that appear in the same buckets associated to the query tuple itself, without completely scanning the original database. Two alternative schemes for computing indexing keys are discussed and compared. An extensive experimental evaluation on both synthetic and real data shows the effectiveness of our approach.
引用
收藏
页码:152 / 187
页数:36
相关论文
共 47 条
[1]  
Agichtein Eugene., 2004, P ACM SIGKDD INT C K, P20
[2]  
Ananthakrishna R., 2002, Proceedings of the Twenty-eighth International Conference on Very Large Data Bases, P586
[3]  
[Anonymous], 1990, String comparator metrics and enhanced decision rules in the fellegi-sunter model of record linkage
[4]  
[Anonymous], 2003, 2003 ACM SIGMOD INT, DOI DOI 10.1145/872757.872796
[5]  
[Anonymous], P ACM SIGMOD INT C M
[6]  
[Anonymous], P 32 INT C VER LARG
[7]  
[Anonymous], P 9 ACM SIGMOD WORKS
[8]  
[Anonymous], 2002, P 8 ACM SIGKDD INT C, DOI DOI 10.1145/775047.775116
[9]  
[Anonymous], 2003, P 9 ACM SIGKDD INT C, DOI DOI 10.1145/956750.956759
[10]  
[Anonymous], PROCEEDINGS OF THE I