Global Detection of Complex Copying Relationships Between Sources

被引:49
作者
Dong, Xin Luna [1 ]
Berti-Equille, Laure [2 ]
Hu, Yifan [1 ]
Srivastava, Divesh [1 ]
机构
[1] AT&T Labs Res, Seattle, WA 98195 USA
[2] Univ Rennes 1, Rennes, France
来源
PROCEEDINGS OF THE VLDB ENDOWMENT | 2010年 / 3卷 / 01期
关键词
D O I
10.14778/1920841.1921008
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Web technologies have enabled data sharing between sources but also simplified copying (and often publishing without proper attribution). The copying relationships can be complex: some sources copy from multiple sources on different subsets of data; some co-copy from the same source, and some transitively copy from another. Understanding such copying relationships is desirable both for business purposes and for improving many key components in data integration, such as resolving conflicts across various sources, reconciling distinct references to the same real-world entity, and efficiently answering queries over multiple sources. Recent works have studied how to detect copying between a pair of sources, but the techniques can fall short in the presence of complex copying relationships. In this paper we describe techniques that discover global copying relationships between a set of structured sources. Towards this goal we make two contributions. First, we propose a global detection algorithm that identifies co-copying and transitive copying, returning only source pairs with direct copying. Second, global detection requires accurate decisions on copying direction; we significantly improve over previous techniques on this by considering various types of evidence for copying and correlation of copying on different data items. Experimental results on real-world data and synthetic data show high effectiveness and efficiency of our techniques.
引用
收藏
页码:1358 / 1369
页数:12
相关论文
共 12 条
[1]  
[Anonymous], 2007, P 6 ACM INT C IM VID
[2]  
Berti-Equille L., 2009, CIDR
[3]  
Blanco L., 2010, CAISE
[4]  
Cover T. M., 2006, ELEMENTS INFORM THEO, DOI [DOI 10.1002/047174882X, DOI 10.1002/047174882X.CH5]
[5]  
Dong X. L., 2009, PVLDB, V2
[6]  
Dong X. L., 2010, PVLDB
[7]   Duplicate record detection: A survey [J].
Elmagarmid, Ahmed K. ;
Ipeirotis, Panagiotis G. ;
Verykios, Vassilios S. .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2007, 19 (01) :1-16
[8]  
Gansner E., 2010, IEEE PAC VIS S
[9]  
Hodson F., 1971, MATH ARCHAEOLOGICAL, P387
[10]  
Koudas N., 2006, SIGMOD