Determining the Real Data Completeness of a Relational Dataset

被引:6
|
作者
Liu, Yong-Nan [1 ]
Li, Jian-Zhong [1 ]
Zou, Zhao-Nian [1 ]
机构
[1] Harbin Inst Technol, Sch Comp Sci & Engn, Harbin 150001, Peoples R China
基金
中国国家自然科学基金;
关键词
data quality; data completeness; functional dependency; data completeness model; optimal algorithm;
D O I
10.1007/s11390-016-1659-x
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Low quality of data is a serious problem in the new era of big data, which can severely reduce the usability of data, mislead or bias the querying, analyzing and mining, and leads to huge loss. Incomplete data is common in low quality data, and it is necessary to determine the data completeness of a dataset to provide hints for follow-up operations on it. Little existing work focuses on the completeness of a dataset, and such work views all missing values as unknown values. In this paper, we study how to determine real data completeness of a relational dataset. By taking advantage of given functional dependencies, we aim to determine some missing attribute values by other tuples and capture the really missing attribute cells. We propose a data completeness model, formalize the problem of determining the real data completeness of a relational dataset, and give a lower bound of the time complexity of this problem. Two optimal algorithms to determine the data completeness of a dataset for different cases are proposed. We empirically show the effectiveness and the scalability of our algorithms on both real-world data and synthetic data.
引用
收藏
页码:720 / 740
页数:21
相关论文
共 50 条
  • [41] An evaluation of data completeness of VGI through geometric similarity assessment
    Chehreghan, Alireza
    Abbaspour, Rahim Ali
    INTERNATIONAL JOURNAL OF IMAGE AND DATA FUSION, 2018, 9 (04) : 319 - 337
  • [42] Relational database schema design for uncertain data
    Link, Sebastian
    Prade, Henri
    INFORMATION SYSTEMS, 2019, 84 : 88 - 110
  • [43] A Data Sorted Method for the Rough Relational Databases
    Wei, Ling-ling
    Xie, Qiang-lai
    COMPUTING AND INTELLIGENT SYSTEMS, PT III, 2011, 233 : 212 - 217
  • [44] Appropriate inferences of data dependencies in relational databases
    Biskup, Joachim
    Link, Sebastian
    ANNALS OF MATHEMATICS AND ARTIFICIAL INTELLIGENCE, 2011, 63 (3-4) : 213 - 255
  • [45] Generalized normal forms for probabilistic relational data
    Dey, D
    Sarkar, S
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2002, 14 (03) : 485 - 497
  • [46] Appropriate inferences of data dependencies in relational databases
    Joachim Biskup
    Sebastian Link
    Annals of Mathematics and Artificial Intelligence, 2011, 63 : 213 - 255
  • [47] A data sorted method for the rough relational databases
    Wei, Ling-ling
    Xie, Qiang-lai
    2010 SECOND INTERNATIONAL CONFERENCE ON E-LEARNING, E-BUSINESS, ENTERPRISE INFORMATION SYSTEMS, AND E-GOVERNMENT (EEEE 2010), VOL I, 2010, : 175 - 178
  • [48] Assessing the harmonization of structured electronic health record data to reference terminologies and data completeness through data provenance
    Marsolo, Keith
    Curtis, Lesley
    Qualls, Laura
    Xu, Jennifer
    Zhang, Yinghong
    Phillips, Thomas
    Hill, C. Larry
    Sanders, Gretchen
    Maro, Judith C.
    Kiernan, Daniel
    Draper, Christine
    Coughlin, Kevin
    Dutcher, Sarah K.
    Hernandez-Munoz, Jose J.
    Falconer, Monique
    LEARNING HEALTH SYSTEMS, 2024,
  • [49] Impact of longitudinal data-completeness of electronic health record data on risk score misclassification
    Jin, Yinzhu
    Schneeweiss, Sebastian
    Merola, Dave
    Lin, Kueiyu Joshua
    JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2022, 29 (07) : 1225 - 1232
  • [50] Enabling smart data selection based on data completeness measures: a quality-aware approach
    Hong, Jung-Hong
    Huang, Min-Lang
    INTERNATIONAL JOURNAL OF GEOGRAPHICAL INFORMATION SCIENCE, 2017, 31 (06) : 1178 - 1197