Improving data quality by source analysis

被引:0
作者
Müller, Heiko [1 ]
Freytag, Johann-Christoph [2 ]
Leser, Ulf [2 ]
机构
[1] Tasmanian ICT Centre, Commonwealth Scientific and Industrial Research Organisation (CSIRO), Hobart TAS 7001
[2] Institut für Informatik, Humbolit-Universität zu Berlin, 10099 Berlin
关键词
Conflict resolution; Data cleaning; Quality assessment; Semantic distance measure;
D O I
10.1145/2107536.2107538
中图分类号
学科分类号
摘要
In many domains, data cleaning is hampered by our limited ability to specify a comprehensive set of integrity constraints to assist in identification of erroneous data. An alternative approach to improve data quality is to exploit different data sources that contain information about the same set of objects. Such overlapping sources highlight hot-spots of poor data quality through conflicting data values and immediately provide alternative values for conflict resolution. In order to derive a dataset of high quality, we can merge the overlapping sources based on a quality assessment of the conflicting values. The quality of the resulting dataset, however, is highly dependent on our ability to asses the quality of conflicting values effectively. The main objective of this article is to introduce methods that aid the developer of an integrated system over overlapping, but contradicting sources in the task of improving the quality of data. Value conflicts between contradicting sources are often systematic, caused by some characteristic of the different sources. Our goal is to identify such systematic differences and outline data patterns that occur in conjunction with them. Evaluated by an expert user, the regularities discovered provide insights into possible conflict reasons and help to assess the quality of inconsistent values. The contributions of this article are two concepts of systematic conflicts: contradiction patterns and minimal update sequences. Contradiction patterns resemble a special form of association rules that summarize characteristic data properties for conflict occurrence. We adapt existing association rule mining algorithms for mining contradiction patterns. Contradiction patterns, however, view each class of conflicts in isolation, sometimes leading to largely overlapping patterns. Sequences of set-oriented update operations that transform one data source into the other are compact descriptions for all regular differences among the sources. We consider minimal update sequences as the most likely explanation for observed differences between overlapping data sources. Furthermore, the order of operations within the sequences point out potential dependencies between systematic differences. Finding minimal update sequences, however, is beyond reach in practice. We show that the problem already is NP-complete for a restricted set of operations. In the light of this intractability result, we present heuristics that lead to convincing results for all examples we considered. © 2012 ACM.
引用
收藏
相关论文
共 63 条
[1]  
Abiteboul S., Cluet S., Milo T., Mogilevsky P., Simon J., Zohar S., Tools for data translation and integration, IEEE Data Engin. Bull., 22, 1, pp. 3-8, (1999)
[2]  
Agrawal R., Srikant R., Fast algorithms for mining association rules in large databases, Proceedings of the 20th International Conference on Very Large Data Bases (VLDB'94)., (1994)
[3]  
Arenas M., Bertossi L., Chomicki J., Consistent query answers in inconsistent databases, Proceedings of the 18th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS'99)., (1999)
[4]  
Baumgartner Jr. W.A., Cohen K.B., Fox L.M., Acquaah-Mensah G., Hunter L., Manual curation is not sufficient for annotation of genomic databases, Bioinformatics, 23, 13, (2007)
[5]  
Bay S.D., Pazzani M.J., Detecting group differences: Mining contrast sets, Data Min. Knowl. Discov, 5, 3, pp. 213-246, (2001)
[6]  
Berman H.M., Westbrook J., Feng Z., Gilliland G., Bhat T.N., Weissig H., Shindyalov I.N., Bourne P.E., The Protein Data Bank, Nucleic Acids Research, 28, 1, pp. 235-242, (2000)
[7]  
Bhat T.N., Bourne P., Feng Z., Gilliland G., Jain S., Ravichandran V., Schneider B., Schneider K., Thanki N., Weissig H., Westbrook J., Berman H.M., The PDB data uniformity project, Nucleic Acids Research, 29, 1, pp. 214-218, (2001)
[8]  
Bleiholder J., Naumann F., Declarative data fusion-syntax, semantics, and implementation, Proceedings of the 9th East European Conference on Advances in Databases and Information Systems., (2005)
[9]  
Bleiholder J., Naumann F., Conflict handling strategies in an integrated information system, Proceedings of the IJCAI Workshop on Information on the Web (IIWeb)., (2006)
[10]  
Bleiholder J., Naumann F., Data fusion, ACM Comput. Surv, 41, 1, pp. 1-41, (2008)