SJClust: Towards a Framework for Integrating Similarity Join Algorithms and Clustering

被引:4
作者
Ribeiro, Leonardo Andrade [1 ]
Cuzzocrea, Alfredo [2 ,3 ]
Alves Bezerra, Karen Aline [4 ]
do Nascimento, Ben Hur Bahia [4 ]
机构
[1] Univ Fed Goias, Inst Informat, Goiania, Go, Brazil
[2] Univ Trieste, Trieste, Italy
[3] ICAR CNR, Trieste, Italy
[4] Univ Fed Lavras, Dept Ciencia Comp, Lavras, Brazil
来源
PROCEEDINGS OF THE 18TH INTERNATIONAL CONFERENCE ON ENTERPRISE INFORMATION SYSTEMS, VOL 1 (ICEIS) | 2016年
关键词
Data Integration; Data Cleaning; Duplicate Identification; Set Similarity Joins; Clustering; QUERY;
D O I
10.5220/0005868700750080
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
A critical task in data cleaning and integration is the identification of duplicate records representing the same real-world entity. A popular approach to duplicate identification employs similarity join to find pairs of similar records followed by a clustering algorithm to group together records that refer to the same entity. However, the clustering algorithm is strictly used as a post-processing step, which slows down the overall performance and only produces results at the end of the whole process. In this paper, we propose SjClust, a framework to integrate similarity join and clustering into a single operation. Our approach allows to smoothly accommodating a variety of cluster representation and merging strategies into set similarity join algorithms, while fully leveraging state-of-the-art optimization techniques.
引用
收藏
页码:75 / 80
页数:6
相关论文
共 23 条
[1]  
Altwaijry H, 2015, PROC VLDB ENDOW, V9, P120
[2]   Query-Driven Approach to Entity Resolution [J].
Altwaijry, Hotham ;
Kalashnikov, Dmitri V. ;
Mehrotra, Sharad .
PROCEEDINGS OF THE VLDB ENDOWMENT, 2013, 6 (14) :1846-1857
[3]  
[Anonymous], 2013, 17 INT DATABASE EN
[4]  
[Anonymous], P SBBD C
[5]  
[Anonymous], 2012, PRINCIPLEDATA INTE
[6]  
[Anonymous], 2004, SIGMOD
[7]  
[Anonymous], 2009, Proc. VLDB Endow., DOI [DOI 10.14778/1687627.1687771, 10.14778/1687627.1687771]
[8]  
Bayardo R.J., 2007, WWW, P131, DOI [DOI 10.1145/1242572.1242591, 10.1145/1242572.1242591]
[9]  
Bohlen M. H, 2006, P 1 INT VLDB WORKSH
[10]  
Cannataro M., 2002, WEBDYN WWW