ProgressER: Adaptive Progressive Approach to Relational Entity Resolution

被引:7
作者
Altowim, Yasser [1 ,2 ]
Kalashnikov, Dmitri, V [3 ,5 ]
Mehrotra, Sharad [4 ]
机构
[1] King Abdulaziz City Sci & Technol, POB 6086, Riyadh 11442, Riyadh, Saudi Arabia
[2] Univ Calif Irvine, Irvine, CA 92697 USA
[3] Univ Calif Irvine, AT&T Labs Res, Irvine, CA USA
[4] Univ Calif Irvine, Dept Comp Sci, Irvine, CA 92697 USA
[5] AT&T Labs Res, 1 AT&T Way, Bedminster, NJ 07921 USA
关键词
Data cleaning; progressive computation; entity resolution; relational entity resolution; collective entity resolution; resolution plan; resolution workflow;
D O I
10.1145/3154410
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Entity resolution (ER) is the process of identifying which entities in a dataset refer to the same real-world object. In relational ER, the dataset consists of multiple entity-sets and relationships among them. Such relationships cause the resolution of some entities to influence the resolution of other entities. For instance, consider a relational dataset that consists of a set of research paper entities and a set of venue entities. In such a dataset, deciding that two research papers are the same may trigger the fact that their venues are also the same. This article proposes a progressive approach to relational ER, named ProgressER, that aims to produce the highest quality result given a constraint on the resolution budget, specified by the user. Such a progressive approach is useful for many emerging analytical applications that require low latency response (and thus cannot tolerate delays caused by cleaning the entire dataset) and/or in situations where the underlying resources are constrained or costly to use. To maximize the quality of the result, ProgressER follows an adaptive strategy that periodically monitors and reassesses the resolution progress to determine which parts of the dataset should be resolved next and how they should be resolved. More specifically ProgressER divides the input budget into several resolution windows and analyzes the resolution progress at the beginning of each window to generate a resolution plan for the current window. A resolution plan specifies which blocks of entities and which entity pairs within blocks need to be resolved during the plan execution phase of that window. In addition, ProgressER specifies, for each identified pair of entities, the order in which the similarity functions should be applied on the pair. Such an order plays a significant role in reducing the overall cost because applying the first few functions in this order might be sufficient to resolve the pair. The empirical evaluation of ProgressER demonstrates its significant advantage in terms of progressiveness over the traditional ER techniques for the given problem settings.
引用
收藏
页数:45
相关论文
共 39 条
[1]   Parallel Progressive Approach to Entity Resolution Using MapReduce [J].
Altowim, Yasser ;
Mehrotra, Sharad .
2017 IEEE 33RD INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2017), 2017, :909-920
[2]  
[Anonymous], 2007, Bayesian networks and decision graphs, DOI DOI 10.1007/978-0-387-68282-2
[3]  
[Anonymous], 2003, 9 ACM SIGKDD INTCONF, DOI DOI 10.1145/956750.956759
[4]  
[Anonymous], 2012, PRINCIPLEDATA INTE
[5]  
[Anonymous], 2004, P KDD 2004 WORKSHOP
[6]  
[Anonymous], 2012, ACM T DATABASE SYST, DOI DOI 10.1145/2109196.2109199
[7]  
[Anonymous], UNCERTAINTY ARTIFICI
[8]  
[Anonymous], 2007, ACM Transactions on Knowledge Discovery from Data (TKDD), DOI [DOI 10.1145/1217299.1217304, 10.1145/1217299.1217304]
[9]  
[Anonymous], P VLDB ENDOWMENT
[10]  
[Anonymous], 2002, P 8 ACM SIGKDD INT C, DOI DOI 10.1145/775047.775087