Pay-As-You-Go Entity Resolution

被引:80
作者
Whang, Steven Euijong [1 ]
Marmaros, David [1 ]
Garcia-Molina, Hector [2 ]
机构
[1] Google Inc, Mountain View, CA 94043 USA
[2] Stanford Univ, Dept Comp Sci, Stanford, CA 94305 USA
关键词
Entity resolution; pay-as-you-go; data cleaning;
D O I
10.1109/TKDE.2012.43
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Entity resolution (ER) is the problem of identifying which records in a database refer to the same entity. In practice, many applications need to resolve large data sets efficiently, but do not require the ER result to be exact. For example, people data from the web may simply be too large to completely resolve with a reasonable amount of work. As another example, real-time applications may not be able to tolerate any ER processing that takes longer than a certain amount of time. This paper investigates how we can maximize the progress of ER with a limited amount of work using "hints," which give information on records that are likely to refer to the same real-world entity. A hint can be represented in various formats (e.g., a grouping of records based on their likelihood of matching), and ER can use this information as a guideline for which records to compare first. We introduce a family of techniques for constructing hints efficiently and techniques for using the hints to maximize the number of matching records identified using a limited amount of work. Using real data sets, we illustrate the potential gains of our pay-as-you-go approach compared to running ER without using hints.
引用
收藏
页码:1111 / 1124
页数:14
相关论文
共 18 条
[1]  
[Anonymous], 2012, TECHNICAL REPORT
[2]  
[Anonymous], P 22 INT C DAT ENG I
[3]  
[Anonymous], P ACM SIGMOD INT C M
[4]  
[Anonymous], P 32 INT C VER LARG
[5]  
[Anonymous], P C INN DAT SYST RES
[6]  
[Anonymous], 2008, Introduction to information retrieval
[7]  
[Anonymous], 2008, P 2008 ACM SIGMOD IN
[8]   Swoosh: a generic approach to entity resolution [J].
Benjelloun, Omar ;
Garcia-Molina, Hector ;
Menestrina, David ;
Su, Qi ;
Whang, Steven Euijong ;
Widom, Jennifer .
VLDB JOURNAL, 2009, 18 (01) :255-276
[9]   Data integration using similarity joins and a word-based information representation language [J].
Cohen, WW .
ACM TRANSACTIONS ON INFORMATION SYSTEMS, 2000, 18 (03) :288-321
[10]  
Dong Xin., 2005, Proceedings of the 2005 ACM SIGMOD international conference on Management of data, P85