Discovery of Genuine Functional Dependencies from Relational Data with Missing Values

被引:30
|
作者
Berti-Equille, Laure [1 ]
Harmouch, Nazar [2 ]
Naumann, Felix [2 ]
Novelli, Noel [1 ]
Saravanan [3 ]
机构
[1] Aix Marseille Univ, CNRS, LIS, Marseille, France
[2] Univ Potsdam, Hasso Plattner Inst, Potsdam, Germany
[3] HBKU, QCRI, Doha, Qatar
来源
PROCEEDINGS OF THE VLDB ENDOWMENT | 2018年 / 11卷 / 08期
关键词
IMPUTATION;
D O I
10.14778/3204028.3204032
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Functional dependencies (FDs) play an important role in maintaining data quality. They can be used to enforce data consistency and to guide repairs over a database. In this work, we investigate the problem of missing values and its impact on FD discovery. When using existing FD discovery algorithms, some genuine FDs could not be detected precisely due to missing values or some non-genuine FDs can be discovered even though they are caused by missing values with a certain NULL semantics. We define a notion of genuineness and propose algorithms to compute the genuineness score of a discovered FD. This can be used to identify the genuine FDs among the set of all valid dependencies that hold on the data. We evaluate the quality of our method over various real-world and semi-synthetic datasets with extensive experiments. The results show that our method performs well for relatively large FD sets and is able to accurately capture genuine FDs.
引用
收藏
页码:880 / 892
页数:13
相关论文
共 50 条
  • [41] Predicting Missing Values in Spatio-Temporal Remote Sensing Data
    Gerber, Florian
    de Jong, Rogier
    Schaepman, Michael E.
    Schaepman-Strub, Gabriela
    Furrer, Reinhard
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2018, 56 (05): : 2841 - 2853
  • [42] Novel Methods for Imputing Missing Values in Water Level Monitoring Data
    Khampuengson, Thakolpat
    Wang, Wenjia
    WATER RESOURCES MANAGEMENT, 2023, 37 (02) : 851 - 878
  • [43] Imputation Strategies for Clustering Mixed-Type Data with Missing Values
    Aschenbruck, Rabea
    Szepannek, Gero
    Wilhelm, Adalbert F. X.
    JOURNAL OF CLASSIFICATION, 2023, 40 (01) : 2 - 24
  • [44] Low-rank model with covariates for count data with missing values
    Robin, Genevieve
    Josse, Julie
    Moulines, Eric
    Sardy, Sylvain
    JOURNAL OF MULTIVARIATE ANALYSIS, 2019, 173 : 416 - 434
  • [45] NON-NEGATIVE MATRIX FACTORIZATION OF CLUSTERED DATA WITH MISSING VALUES
    Chen, Rebecca
    Varshney, Lav R.
    2019 IEEE DATA SCIENCE WORKSHOP (DSW), 2019, : 180 - 184
  • [46] Using phylogenetic information to impute missing functional trait values in ecological databases
    Debastiani, Vanderlei J.
    Bastazini, Vinicius A. G.
    Pillar, Valerio D.
    ECOLOGICAL INFORMATICS, 2021, 63
  • [47] MissDAG: Causal Discovery in the Presence of Missing Data with Continuous Additive Noise Models
    Gao, Erdun
    Ng, Ignavier
    Gong, Mingming
    Shen, Li
    Huang, Wei
    Liu, Tongliang
    Zhang, Kun
    Bondell, Howard
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35, NEURIPS 2022, 2022,
  • [48] XeroGraph: enhancing data integrity in the presence of missing values with statistical and predictive analysis
    Alasal, Laila Mousafi
    Hammarlund, Emma U.
    Pienta, Kenneth J.
    Ronnstrand, Lars
    Kazi, Julhash U.
    BIOINFORMATICS ADVANCES, 2025, 5 (01):
  • [49] The Glivenko-Cantelli theorem based on data with randomly imputed missing values
    Mojirsheibani, M
    STATISTICS & PROBABILITY LETTERS, 2001, 55 (04) : 385 - 396
  • [50] Treating missing values in INAR(1) models: An application to syndromic surveillance data
    Andersson, Jonas
    Karlis, Dimitris
    JOURNAL OF TIME SERIES ANALYSIS, 2010, 31 (01) : 12 - 19