Discovering Relaxed Functional Dependencies Based on Multi-Attribute Dominance

被引:21
作者
Caruccio, Loredana [1 ]
Deufemia, Vincenzo [1 ]
Naumann, Felix [2 ]
Polese, Giuseppe [1 ]
机构
[1] Univ Salerno, Dept Comp Sci, I-84084 Fisciano, SA, Italy
[2] Univ Potsdam, Hasso Plattner Inst, D-14482 Potsdam, Germany
关键词
Complexity theory; Approximation algorithms; Big Data; Distributed databases; Semantics; Lakes; Functional dependencies; data profiling; data cleansing; EFFICIENT DISCOVERY;
D O I
10.1109/TKDE.2020.2967722
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
With the advent of big data and data lakes, data are often integrated from multiple sources. Such integrated data are often of poor quality, due to inconsistencies, errors, and so forth. One way to check the quality of data is to infer functional dependencies (fds). However, in many modern applications it might be necessary to extract properties and relationships that are not captured through fds, due to the necessity to admit exceptions, or to consider similarity rather than equality of data values. Relaxed fds (rfds) have been introduced to meet these needs, but their discovery from data adds further complexity to an already complex problem, also due to the necessity of specifying similarity and validity thresholds. We propose Domino, a new discovery algorithm for rfds that exploits the concept of dominance in order to derive similarity thresholds of attribute values while inferring rfds. An experimental evaluation on real datasets demonstrates the discovery performance and the effectiveness of the proposed algorithm.
引用
收藏
页码:3212 / 3228
页数:17
相关论文
共 31 条
  • [1] Abedjan Ziawasch, 2014, P 23 ACM INT C C INF, P949, DOI [10.1145/2661829.2661884, DOI 10.1145/2661829.2661884]
  • [2] Caruccio L, 2018, IEEE INT CONF BIG DA, P5078, DOI 10.1109/BigData.2018.8622011
  • [3] Relaxed Functional Dependencies-A Survey of Approaches
    Caruccio, Loredana
    Deufemia, Vincenzo
    Polese, Giuseppe
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2016, 28 (01) : 147 - 165
  • [4] Caruccio Loredana, 2017, P 7 INT C WEB INT MI, P5
  • [5] Fame for sale: Efficient detection of fake Twitter followers
    Cresci, Stefano
    Di Pietro, Roberto
    Petrocchi, Marinella
    Spognardi, Angelo
    Tesconi, Maurizio
    [J]. DECISION SUPPORT SYSTEMS, 2015, 80 : 56 - 71
  • [6] A Revival of Integrity Constraints for Data Cleaning
    Fan, Wenfei
    Geerts, Floris
    Jia, Xibei
    [J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2008, 1 (02): : 1522 - 1523
  • [7] Dynamic constraints for record matching
    Fan, Wenfei
    Gao, Hong
    Jia, Xibei
    Li, Jianzhong
    Ma, Shuai
    [J]. VLDB JOURNAL, 2011, 20 (04) : 495 - 520
  • [8] Flach PA, 1999, AI COMMUN, V12, P139
  • [9] TANE:: An efficient algorithm for discovering functional and approximate dependencies
    Huhtala, Y
    Kärkkäinen, J
    Porkka, P
    Toivonen, H
    [J]. COMPUTER JOURNAL, 1999, 42 (02) : 100 - 111
  • [10] APPROXIMATE INFERENCE OF FUNCTIONAL-DEPENDENCIES FROM RELATIONS
    KIVINEN, J
    MANNILA, H
    [J]. THEORETICAL COMPUTER SCIENCE, 1995, 149 (01) : 129 - 149