A partial-order-based framework for cost-effective crowdsourced entity resolution

被引:29
作者
Chai, Chengliang [1 ]
Li, Guoliang [1 ]
Li, Jian [1 ]
Deng, Dong [1 ]
Feng, Jianhua [1 ]
机构
[1] Tsinghua Univ, Dept Comp Sci & Technol, Beijing 100084, Peoples R China
关键词
Crowdsourced entity resolution; Partial order; Quality; Cost; Latency;
D O I
10.1007/s00778-018-0509-6
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Crowdsourced entity resolution has recently attracted significant attentions because it can harness the wisdom of crowd to improve the quality of entity resolution. However, existing techniques either cannot achieve high quality or incur huge monetary costs. To address these problems, we propose a cost-effective crowdsourced entity resolution framework, which significantly reduces the monetary cost while keeping high quality. We first define a partial order on the pairs of records. Then, we select a pair as a question and ask the crowd to check whether the records in the pair refer to the same entity. After getting the answer of this pair, we infer the answers of other pairs based on the partial order. Next, we iteratively select pairs without answers to ask until we get the answers of all pairs. We devise effective algorithms to judiciously select the pairs to ask in order to minimize the number of asked pairs. To further reduce the cost, we propose a grouping technique to group the pairs and we only ask one pair instead of all pairs in each group. We develop error-tolerant techniques to tolerate the errors introduced by the partial order and the crowd. We also study the budget-aware entity resolution, which, given a budget, finds the maximum number of matching pairs within the budget, and propose effective optimization techniques. Experimental results show that our method reduces the cost to 1.25% of existing approaches (or existing approaches take 80x monetary cost of our method) while not sacrificing the quality.
引用
收藏
页码:745 / 770
页数:26
相关论文
共 58 条
[1]  
[Anonymous], 2009, Advances in Neural Information Processing Systems
[2]  
[Anonymous], 2003, 9 ACM SIGKDD INTCONF, DOI DOI 10.1145/956750.956759
[3]  
[Anonymous], 1956, Proceedings of the American Mathematical Society, DOI DOI 10.2307/2033375
[4]  
[Anonymous], 2011, P 2011 ACM SIGMOD IN
[5]  
[Anonymous], 2015, PROC 18 INT C EXTEND
[6]  
[Anonymous], 2010, Proceedings of the ACM SIGKDD Workshop on Human Computation, noeth, DOI [10.1145/1837885.1837906, DOI 10.1145/1837885.1837906]
[7]  
[Anonymous], 2012, AAAI
[8]  
[Anonymous], 2015, P 18 INT C EXT DAT T
[9]  
Aydin BI, 2014, AAAI CONF ARTIF INTE, P2946
[10]  
Bennett P. N., 2013, P 6 ACM INT C WEB SE, P193, DOI [DOI 10.1145/2433396.2433420, DOI 10.1145/2433396.2433420.URL]