Dynamic constraints for record matching

被引:57
作者
Fan, Wenfei [1 ]
Gao, Hong [1 ]
Jia, Xibei [2 ]
Li, Jianzhong [1 ]
Ma, Shuai [2 ]
机构
[1] Harbin Inst Technol, Harbin 150006, Heilongjiang, Peoples R China
[2] Univ Edinburgh, Sch Informat, Edinburgh, Midlothian, Scotland
基金
英国工程与自然科学研究理事会;
关键词
Deduction; Inference; Matching dependencies; Record matching; Data cleaning; DEPENDENCIES; LINKAGE;
D O I
10.1007/s00778-010-0206-6
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
This paper investigates constraints for matching records from unreliable data sources. (a) We introduce a class of matching dependencies (mds) for specifying the semantics of unreliable data. As opposed to static constraints for schema design, mds are developed for record matching, and are defined in terms of similarity predicates and a dynamic semantics. (b) We identify a special case of mds, referred to as relative candidate keys (rcks), to determine what attributes to compare and how to compare them when matching records across possibly different relations. (c) We propose a mechanism for inferring mds, a departure from traditional implication analysis, such that when we cannot match records by comparing attributes that contain errors, we may still find matches by using other, more reliable attributes. Moreover, we develop a sound and complete system for inferring mds. (d) We provide a quadratic-time algorithm for inferring mds and an effective algorithm for deducing a set of high-quality rcks from mds. (e) We experimentally verify that the algorithms help matching tools efficiently identify keys at compile time for matching, blocking or windowing and in addition, that the md-based techniques effectively improve the quality and efficiency of various record matching methods.
引用
收藏
页码:495 / 520
页数:26
相关论文
共 46 条
[1]  
Abiteboul S., 1995, Foundations of databases, V8
[2]  
Ananthakrishna R., 2002, VLDB
[3]  
[Anonymous], SIGMOD
[4]  
[Anonymous], ICDE
[5]  
[Anonymous], Soundex
[6]  
[Anonymous], 2006, Data Quality: Concepts, Methodologies and Techniques, DOI [DOI 10.1007/3-540-33173-5_1, DOI 10.1007/3-540-33173-5]
[7]  
[Anonymous], 2007, SIGMOD
[8]  
[Anonymous], 1983, The Theory of Relational Database
[9]  
ARASU A, 2008, ICDE
[10]  
Arasu Arvind., 2009, ICDE