On the Meaningfulness of "Big Data Quality" (Invited Paper)

被引:49
作者
Firmani, Donatella [1 ]
Mecella, Massimo [2 ]
Scannapieco, Monica [3 ]
Batini, Carlo [4 ]
机构
[1] Univ Roma Tor Vergata, Rome, Italy
[2] Sapienza Univ Roma, Rome, Italy
[3] Ist Nazl Stat ISTAT, Rome, Italy
[4] Univ Milano Bicocca, Milan, Italy
关键词
Data quality; Big data; Quality dimensions; Information quality;
D O I
10.1007/s41019-015-0004-7
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In this paper, we discuss the application of concept of data quality to big data by highlighting how much complex is to define it in a general way. Already data quality is a multidimensional concept, difficult to characterize in precise definitions even in the case of well-structured data. Big data add two further dimensions of complexity: (i) being "very" source specific, and for this we adopt the interesting UNECE classification, and (ii) being highly unstructured and schema-less, often without golden standards to refer to or very difficult to access. After providing a tutorial on data quality in traditional contexts, we analyze big data by providing insights into the UNECE classification, and then, for each type of data source, we choose a specific instance of such a type (notably deep Web data, sensor-generated data, and Twitters/short texts) and discuss how quality dimensions can be defined in these cases. The overall aim of the paper is therefore to identify further research directions in the area of big data quality, by providing at the same time an up-to-date state of the art on data quality.
引用
收藏
页码:6 / 20
页数:15
相关论文
共 54 条
[1]  
[Anonymous], 1967, WRIGHT PATTERSON AIR
[2]  
Batini C, 2012, P 17 INT C INF QUAL
[3]  
Batini C, 2015, DATA INFORM QUALITY
[4]  
Bergman M. K., 2001, Journal of Electronic Publishing, V7, DOI 10.3998/3336451.0007.104
[5]  
Bizer C., 2007, THESIS
[6]   Linked Data - The Story So Far [J].
Bizer, Christian ;
Heath, Tom ;
Berners-Lee, Tim .
INTERNATIONAL JOURNAL ON SEMANTIC WEB AND INFORMATION SYSTEMS, 2009, 5 (03) :1-22
[7]  
Carroll J, 2003, HPL2003142
[8]  
Chall JS., 1995, READABILITY REVISITE
[9]  
Cohen William, 2003, KDD WORKSH DAT CLEAN
[10]  
Crosby P. B., 1979, QUALITY IS FREE