Determining the Real Data Completeness of a Relational Dataset

被引:6
|
作者
Liu, Yong-Nan [1 ]
Li, Jian-Zhong [1 ]
Zou, Zhao-Nian [1 ]
机构
[1] Harbin Inst Technol, Sch Comp Sci & Engn, Harbin 150001, Peoples R China
基金
中国国家自然科学基金;
关键词
data quality; data completeness; functional dependency; data completeness model; optimal algorithm;
D O I
10.1007/s11390-016-1659-x
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Low quality of data is a serious problem in the new era of big data, which can severely reduce the usability of data, mislead or bias the querying, analyzing and mining, and leads to huge loss. Incomplete data is common in low quality data, and it is necessary to determine the data completeness of a dataset to provide hints for follow-up operations on it. Little existing work focuses on the completeness of a dataset, and such work views all missing values as unknown values. In this paper, we study how to determine real data completeness of a relational dataset. By taking advantage of given functional dependencies, we aim to determine some missing attribute values by other tuples and capture the really missing attribute cells. We propose a data completeness model, formalize the problem of determining the real data completeness of a relational dataset, and give a lower bound of the time complexity of this problem. Two optimal algorithms to determine the data completeness of a dataset for different cases are proposed. We empirically show the effectiveness and the scalability of our algorithms on both real-world data and synthetic data.
引用
收藏
页码:720 / 740
页数:21
相关论文
共 50 条
  • [21] A methodology for the automatic evaluation of data quality and completeness of nanomaterials for risk assessment purposes
    Basei, Gianpietro
    Rauscher, Hubert
    Jeliazkova, Nina
    Hristozov, Danail
    NANOTOXICOLOGY, 2022, 16 (02) : 195 - 216
  • [22] Measuring Data Completeness for Microbial Genomics Database
    Emran, Nurul A.
    Embury, Suzanne
    Missier, Paolo
    Isa, Mohd Noor Mat
    Muda, Azah Kamilah
    INTELLIGENT INFORMATION AND DATABASE SYSTEMS (ACIIDS 2013), PT I,, 2013, 7802 : 186 - 195
  • [23] Profiling relational data: a survey
    Abedjan, Ziawasch
    Golab, Lukasz
    Naumann, Felix
    VLDB JOURNAL, 2015, 24 (04) : 557 - 581
  • [24] iCoDA: Interactive and Exploratory Data Completeness Analysis
    Liu, Ruilin
    Wang, Guan
    Wang, Wendy Hui
    Korn, Flip
    2014 IEEE 30TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), 2014, : 1226 - 1229
  • [25] Factors Influencing Data Completeness in Electronic Records: A Case Study in a Chinese Manufacturing Enterprise
    Lan, Chaowang
    Peng, Guochao
    Zhou, Hui
    Su, Lishen
    Huang, Yaosheng
    Wu, Dayou
    Liu, Caihua
    DISTRIBUTED, AMBIENT AND PERVASIVE INTERACTIONS, PT I, DAPI 2024, 2024, 14718 : 221 - 240
  • [26] Assessing OSM building completeness using population data
    Zhang, Yuheng
    Zhou, Qi
    Brovelli, Maria Antonia
    Li, Wanjing
    INTERNATIONAL JOURNAL OF GEOGRAPHICAL INFORMATION SCIENCE, 2022, 36 (07) : 1443 - 1466
  • [27] Evaluation of neonatal mortality data completeness and accuracy in Ghana
    Dadzie, Dora
    Boadu, Richard Okyere
    Engmann, Cyril Mark
    Twum-Danso, Nana Amma Yeboaa
    PLOS ONE, 2021, 16 (03):
  • [28] Assessing Completeness of IoT Data: A Novel Probabilistic Approach
    Klier, Mathias
    Moestue, Lars
    Obermeier, Andreas
    Widmann, Torben
    BUSINESS & INFORMATION SYSTEMS ENGINEERING, 2024,
  • [29] Supervised Learning for data cleaning in the coherence and completeness dimensions
    Amezquita, Juan C.
    Eslava, Hermes J.
    INGENIERIA Y COMPETITIVIDAD, 2022, 24 (02):
  • [30] Data Completeness and Concordance in the FeverApp Registry: Comparative Study
    Rathjens, Larisa
    Fingerhut, Ingo
    Martin, David
    Kerdar, Sara Hamideh
    Gwiasda, Moritz
    Schwarz, Silke
    Jenetzky, Ekkehart
    JMIR PEDIATRICS AND PARENTING, 2022, 5 (04):