Conflicts to Harmony: A Framework for Resolving Conflicts in Heterogeneous Data by Truth Discovery

被引:87
作者
Li, Yaliang [1 ]
Li, Qi [1 ]
Gao, Jing [1 ]
Su, Lu [1 ]
Zhao, Bo [2 ]
Fan, Wei [3 ]
Han, Jiawei [4 ]
机构
[1] SUNY Buffalo, 338 Davis Hall, Buffalo, NY 14260 USA
[2] LinkedIn, 2029 Stierlin Ct, Mountain View, CA 94043 USA
[3] Baidu Res Big Data Lab, 1195 Bordeaux Dr, Sunnyvale, CA 94089 USA
[4] Univ Illinois, 201 N Goodwin Ave, Urbana, IL 61801 USA
基金
美国国家科学基金会;
关键词
Data fusion; truth discovery; heterogeneous data;
D O I
10.1109/TKDE.2016.2559481
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In many applications, one can obtain descriptions about the same objects or events from a variety of sources. As a result, this will inevitably lead to data or information conflicts. One important problem is to identify the true information (i.e., the truths) among conflicting sources of data. It is intuitive to trust reliable sources more when deriving the truths, but it is usually unknown which one is more reliable a priori. Moreover, each source possesses a variety of properties with different data types. An accurate estimation of source reliability has to be made by modeling multiple properties in a unified model. Existing conflict resolution work either does not conduct source reliability estimation, or models multiple properties separately. In this paper, we propose to resolve conflicts among multiple sources of heterogeneous data types. We model the problem using an optimization framework where truths and source reliability are defined as two sets of unknown variables. The objective is to minimize the overall weighted deviation between the truths and the multi-source observations where each source is weighted by its reliability. Different loss functions can be incorporated into this framework to recognize the characteristics of various data types, and efficient computation approaches are developed. The proposed framework is further adapted to deal with streaming data in an incremental fashion and large-scale data in MapReduce model. Experiments on real-world weather, stock, and flight data as well as simulated multi-source data demonstrate the advantage of jointly modeling different data types in the proposed framework.
引用
收藏
页码:1986 / 1999
页数:14
相关论文
共 36 条
[1]  
[Anonymous], 2006, NIPS
[2]  
[Anonymous], 2009, Proc. VLDB Endow, DOI DOI 10.14778/1687627.1687690
[3]  
Banerjee A, 2005, J MACH LEARN RES, V6, P1705
[4]  
Bertsekas DP., 1999, NONLINEAR PROGRAMMIN
[5]   Data Fusion [J].
Bleiholder, Jens ;
Naumann, Felix .
ACM COMPUTING SURVEYS, 2008, 41 (01) :1-41
[6]  
Cormen T. H., 2009, Introduction to Algorithms
[7]  
Dean J, 2004, USENIX ASSOCIATION PROCEEDINGS OF THE SIXTH SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION (OSDE '04), P137
[8]   Less is More: Selecting Sources Wisely for Integration [J].
Dong, Xin Luna ;
Saha, Barna ;
Srivastava, Divesh .
PROCEEDINGS OF THE VLDB ENDOWMENT, 2012, 6 (02) :37-48
[9]  
Dong XL, 2013, PROC INT CONF DATA, P1245, DOI 10.1109/ICDE.2013.6544914
[10]   Data Fusion - Resolving Data Conflicts for Integration [J].
Dong, Xin Luna ;
Naumann, Felix .
PROCEEDINGS OF THE VLDB ENDOWMENT, 2009, 2 (02) :1654-1655