Resolving Conflicts in Heterogeneous Data by Truth Discovery and Source Reliability Estimation

被引:344
作者
Li, Qi [1 ]
Li, Yaliang [1 ]
Gao, Jing [1 ]
Zhao, Bo [2 ]
Fan, Wei [3 ]
Han, Jiawei [4 ]
机构
[1] SUNY Buffalo, Buffalo, NY 14260 USA
[2] Microsoft Res, Mountain View, CA USA
[3] Huawei Noahs Ark Lab, Hong Kong, Peoples R China
[4] Univ Illinois, Urbana, IL 61801 USA
来源
SIGMOD'14: PROCEEDINGS OF THE 2014 ACM SIGMOD INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA | 2014年
基金
美国国家科学基金会;
关键词
D O I
10.1145/2588555.2610509
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In many applications, one can obtain descriptions about the same objects or events from a variety of sources. As a result, this will inevitably lead to data or information conflicts. One important problem is to identify the true information (i.e., the truths) among conflicting sources of data. It is intuitive to trust reliable sources more when deriving the truths, but it is usually unknown which one is more reliable a priori. Moreover, each source possesses a variety of properties with different data types. An accurate estimation of source reliability has to be made by modeling multiple properties in a unified model. Existing conflict resolution work either does not conduct source reliability estimation, or models multiple properties separately. In this paper, we propose to resolve conflicts among multiple sources of heterogeneous data types. We model the problem using an optimization framework where truths and source reliability are defined as two sets of unknown variables. The objective is to minimize the overall weighted deviation between the truths and the multi-source observations where each source is weighted by its reliability. Different loss functions can be incorporated into this framework to recognize the characteristics of various data types, and efficient computation approaches are developed. Experiments on real-world weather, stock and flight data as well as simulated multi-source data demonstrate the necessity of jointly modeling different data types in the proposed framework(1).
引用
收藏
页码:1187 / 1198
页数:12
相关论文
共 24 条
[1]  
[Anonymous], 2006, NIPS
[2]  
[Anonymous], PVLDB
[3]  
[Anonymous], 1999, Athena scientific Belmont
[4]  
[Anonymous], 2009, Proc. VLDB Endow, DOI DOI 10.14778/1687627.1687690
[5]  
[Anonymous], 2012, P VLDB WORKSH QUAL D
[6]  
Banerjee A, 2005, J MACH LEARN RES, V6, P1705
[7]  
Blanco L, 2010, LECT NOTES COMPUT SC, V6051, P83, DOI 10.1007/978-3-642-13094-6_8
[8]  
Bleiholder J., 2006, P IIWEB
[9]   Data Fusion [J].
Bleiholder, Jens ;
Naumann, Felix .
ACM COMPUTING SURVEYS, 2008, 41 (01) :1-41
[10]  
Boyd S., 2004, CONVEX OPTIMIZATION, VFirst, DOI DOI 10.1017/CBO9780511804441