A model-based evaluation of data quality activities in KDD

被引:26
作者
Mezzanzanica, Mario [1 ,2 ]
Boselli, Roberto [1 ,2 ]
Cesarini, Mirko [1 ,2 ]
Mercorio, Fabio [2 ]
机构
[1] Univ Milano Bicocca, Dept Stat & Quantitat Methods, I-20126 Milan, Italy
[2] Univ Milano Bicocca, CRISP Res Ctr, I-20126 Milan, Italy
关键词
Data quality; Data cleansing; Model checking; Real-life application; CHECKING; KNOWLEDGE;
D O I
10.1016/j.ipm.2014.07.007
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We live in the Information Age, where most of the personal, business, and administrative data are collected and managed electronically. However, poor data quality may affect the effectiveness of knowledge discovery processes, thus making the development of the data improvement steps a significant concern. In this paper we propose the Multidimensional Robust Data Quality Analysis, a domain-independent technique aimed to improve data quality by evaluating the effectiveness of a black-box cleansing function. Here, the proposed approach has been realized through model checking techniques and then applied on a weakly structured dataset describing the working careers of millions of people. Our experimental outcomes show the effectiveness of our model-based approach for data quality as they provide a fine-grained analysis of both the source dataset and the cleansing procedures, enabling domain experts to identify the most relevant quality issues as well as the action points for improving the cleansing activities. Finally, an anonymized version of the dataset and the analysis results have been made publicly available to the community. (C) 2014 Elsevier Ltd. All rights reserved.
引用
收藏
页码:144 / 166
页数:23
相关论文
共 101 条
  • [1] Afanasiev L, 2004, 11TH INTERNATIONAL SYMPOSIUM ON TEMPORAL REPRESENTATION AND REASONING, PROCEEDINGS, P117
  • [2] [Anonymous], 2013, Handbook of Data Quality
  • [3] [Anonymous], P INTERACT 2011 WORK
  • [4] [Anonymous], 2009, 201671 ES EUR TEL ST
  • [5] [Anonymous], HDB MASSIVE DATA SET
  • [6] [Anonymous], 2009, P INT C AUT PLANN SC, DOI [10.1609/icaps.v19i1.13352, DOI 10.1609/ICAPS.V19I1.13352]
  • [7] Arenas M., 1999, Proceedings of the Eighteenth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, P68, DOI 10.1145/303976.303983
  • [8] STATE-BASED MODEL CHECKING OF EVENT-DRIVEN SYSTEM REQUIREMENTS
    ATLEE, JM
    GANNON, J
    [J]. IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 1993, 19 (01) : 24 - 40
  • [9] Baler C., 2008, PRINCIPLES MODEL CHE
  • [10] Enhancing data quality in data warehouse environments
    Ballou, DP
    Tayi, GK
    [J]. COMMUNICATIONS OF THE ACM, 1999, 42 (01) : 73 - 78