Determining the Real Data Completeness of a Relational Dataset

被引:0
作者
Yong-Nan Liu
Jian-Zhong Li
Zhao-Nian Zou
机构
[1] Harbin Institute of Technology,School of Computer Science and Engineering
来源
Journal of Computer Science and Technology | 2016年 / 31卷
关键词
data quality; data completeness; functional dependency; data completeness model; optimal algorithm;
D O I
暂无
中图分类号
学科分类号
摘要
Low quality of data is a serious problem in the new era of big data, which can severely reduce the usability of data, mislead or bias the querying, analyzing and mining, and leads to huge loss. Incomplete data is common in low quality data, and it is necessary to determine the data completeness of a dataset to provide hints for follow-up operations on it. Little existing work focuses on the completeness of a dataset, and such work views all missing values as unknown values. In this paper, we study how to determine real data completeness of a relational dataset. By taking advantage of given functional dependencies, we aim to determine some missing attribute values by other tuples and capture the really missing attribute cells. We propose a data completeness model, formalize the problem of determining the real data completeness of a relational dataset, and give a lower bound of the time complexity of this problem. Two optimal algorithms to determine the data completeness of a dataset for different cases are proposed. We empirically show the effectiveness and the scalability of our algorithms on both real-world data and synthetic data.
引用
收藏
页码:720 / 740
页数:20
相关论文
共 45 条
  • [1] Rahm E(2000)Data cleaning: Problems and current approaches IEEE Data Eng. Bull. 23 3-13
  • [2] Do HH(2002)Data warehousing special report: Data quality and the bottom line Application Development Trends 5 1-9
  • [3] Eckerson WW(2011)Missing data mechanisms and their implications on the analysis of categorical data Statistics and Computing 21 31-43
  • [4] Poleto FZ(2011)Usher: Improving data quality with dynamic forms IEEE Transactions on Knowledge and Data Engineering 23 1138-1153
  • [5] Singer JM(2015)Knowledge-based trust: Estimating the trustworthiness of web sources Proceedings of the VLDB Endowment 8 938-949
  • [6] Paulino CD(1989)Integrity = Validity + Completeness ACM Transactions on Database Systems 14 480-502
  • [7] Chen K(2010)Sampling the repairs of functional dependency violations under hard constraints Proceedings of the VLDB Endowment 3 197-207
  • [8] Chen H(2011)Linking temporal records Proceedings of the VLDB Endowment 4 956-967
  • [9] Conway N(2004)Completeness of integrated information sources Information Systems 29 583-615
  • [10] Hellerstein JM(2011)Completeness of queries over incomplete databases Proceedings of the VLDB Endowment 4 749-760