A Hybrid Data Cleaning Framework Using Markov Logic Networks (Extended Abstract)

被引:1
作者
Ge, Congcong [1 ]
Gao, Yunjun [1 ]
Miao, Xiaoye [2 ]
Yao, Bin [3 ]
Wang, Haobo [1 ]
机构
[1] Zhejiang Univ, Coll Comp Sci, Hangzhou, Peoples R China
[2] Zhejiang Univ, Ctr Data Sci, Hangzhou, Peoples R China
[3] Shanghai Jiao Tong Univ, Dept Comp Sci & Engn, Shanghai, Peoples R China
来源
2021 IEEE 37TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2021) | 2021年
关键词
D O I
10.1109/ICDE51399.2021.00258
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
With the growth of dirty data, data cleaning turns into a crux of data analysis. In this paper, we propose a novel hybrid data cleaning framework, termed as MLNClean, which is capable of learning instantiated rules to supplement the insufficient integrity constraints. MLNClean consists of two steps, i.e., pre processing and two stage data cleaning. In the pre-processing step, MLNClean first infers a set of probable instantiated rules according to Markov logic network (MLN) and then builds a two-layer MLN index to generate multiple data versions and facilitate the cleaning process. In the two-stage data cleaning step, MLNClean first presents a concept of reliability score to clean errors within each data version separately, and then, it eliminates the conflict values among different data versions using a novel concept of fusion score. Considerable experimental results on both real and synthetic scenarios demonstrate the effectiveness of MLNClean.
引用
收藏
页码:2344 / 2345
页数:2
相关论文
共 4 条
  • [1] Ge C., 2020, TKDE
  • [2] Distilling relations using knowledge bases
    Hao, Shuang
    Tang, Nan
    Li, Guoliang
    Li, Jian
    Feng, Jianhua
    [J]. VLDB JOURNAL, 2018, 27 (04) : 497 - 519
  • [3] BigDansing: A System for Big Data Cleansing
    Khayyat, Zuhair
    Ilyas, Ihab F.
    Jindal, Alekh
    Madden, Samuel
    Ouzzani, Mourad
    Papotti, Paolo
    Quiane-Ruiz, Jorge-Arnulfo
    Tang, Nan
    Yin, Si
    [J]. SIGMOD'15: PROCEEDINGS OF THE 2015 ACM SIGMOD INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2015, : 1215 - 1230
  • [4] Rekatsinas T, 2017, PROC VLDB ENDOW, V10, P1190