A Confidence-Aware Approach for Truth Discovery on Long-Tail Data

被引:249
作者
Li, Qi [1 ]
Li, Yaliang [1 ]
Gao, Jing [1 ]
Su, Lu [1 ]
Zhao, Bo [2 ]
Demirbas, Murat [1 ]
Fan, Wei [3 ]
Han, Jiawei [4 ]
机构
[1] SUNY Buffalo, Buffalo, NY 14260 USA
[2] Microsoft Res, Mountain View, CA USA
[3] Huawei Noahs Ark Lab, Hong Kong, Hong Kong, Peoples R China
[4] Univ Illinois, Urbana, IL 61801 USA
来源
PROCEEDINGS OF THE VLDB ENDOWMENT | 2014年 / 8卷 / 04期
基金
美国国家科学基金会;
关键词
D O I
10.14778/2735496.2735505
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In many real world applications, the same item may be described by multiple sources. As a consequence, conflicts among these sources are inevitable, which leads to an important task: how to identify which piece of information is trustworthy, i.e., the truth discovery task. Intuitively, if the piece of information is from a reliable source, then it is more trustworthy, and the source that provides trustworthy information is more reliable. Based on this principle, truth discovery approaches have been proposed to infer source reliability degrees and the most trustworthy information (i.e., the truth) simultaneously. However, existing approaches overlook the ubiquitous long-tail phenomenon in the tasks, i.e., most sources only provide a few claims and only a few sources make plenty of claims, which causes the source reliability estimation for small sources to be unreasonable. To tackle this challenge, we propose a confidence-aware truth discovery (CATD) method to automatically detect truths from conflicting data with long-tail phenomenon. The proposed method not only estimates source reliability, but also considers the confidence interval of the estimation, so that it can effectively reflect real source reliability for sources with various levels of participation. Experiments on four real world tasks as well as simulated multi-source long-tail datasets demonstrate that the proposed method outperforms existing state-of-the-art truth discovery approaches by successful discounting the effect of small sources.
引用
收藏
页码:425 / 436
页数:12
相关论文
共 38 条
[1]  
Adler RJ, 1998, PRACTICAL GUIDE TO HEAVY TAILS, P133
[2]  
Alzantot M., 2012, P 20 INT C ADV GEOGR, P99
[3]  
[Anonymous], 2010, P WSDM 2010 3 INT C
[4]  
Aydin BI, 2014, AAAI CONF ARTIF INTE, P2946
[5]  
Bachrach Yoram, 2012, P ICML, P255
[6]  
Bleiholder J., 2006, P WWW
[7]   Data Fusion [J].
Bleiholder, Jens ;
Naumann, Felix .
ACM COMPUTING SURVEYS, 2008, 41 (01) :1-41
[8]  
Boyd S., 2004, CONVEX OPTIMIZATION
[9]   Power-Law Distributions in Empirical Data [J].
Clauset, Aaron ;
Shalizi, Cosma Rohilla ;
Newman, M. E. J. .
SIAM REVIEW, 2009, 51 (04) :661-703
[10]  
Dong X.L., 2013, HDB DATA QUALITY, P293