A Hybrid Data Cleaning Framework Using Markov Logic Networks (Extended Abstract)
被引:1
作者:
Ge, Congcong
论文数: 0引用数: 0
h-index: 0
机构:
Zhejiang Univ, Coll Comp Sci, Hangzhou, Peoples R ChinaZhejiang Univ, Coll Comp Sci, Hangzhou, Peoples R China
Ge, Congcong
[1
]
Gao, Yunjun
论文数: 0引用数: 0
h-index: 0
机构:
Zhejiang Univ, Coll Comp Sci, Hangzhou, Peoples R ChinaZhejiang Univ, Coll Comp Sci, Hangzhou, Peoples R China
Gao, Yunjun
[1
]
Miao, Xiaoye
论文数: 0引用数: 0
h-index: 0
机构:
Zhejiang Univ, Ctr Data Sci, Hangzhou, Peoples R ChinaZhejiang Univ, Coll Comp Sci, Hangzhou, Peoples R China
Miao, Xiaoye
[2
]
Yao, Bin
论文数: 0引用数: 0
h-index: 0
机构:
Shanghai Jiao Tong Univ, Dept Comp Sci & Engn, Shanghai, Peoples R ChinaZhejiang Univ, Coll Comp Sci, Hangzhou, Peoples R China
Yao, Bin
[3
]
Wang, Haobo
论文数: 0引用数: 0
h-index: 0
机构:
Zhejiang Univ, Coll Comp Sci, Hangzhou, Peoples R ChinaZhejiang Univ, Coll Comp Sci, Hangzhou, Peoples R China
Wang, Haobo
[1
]
机构:
[1] Zhejiang Univ, Coll Comp Sci, Hangzhou, Peoples R China
[2] Zhejiang Univ, Ctr Data Sci, Hangzhou, Peoples R China
[3] Shanghai Jiao Tong Univ, Dept Comp Sci & Engn, Shanghai, Peoples R China
来源:
2021 IEEE 37TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2021)
|
2021年
关键词:
D O I:
10.1109/ICDE51399.2021.00258
中图分类号:
TP [自动化技术、计算机技术];
学科分类号:
0812 ;
摘要:
With the growth of dirty data, data cleaning turns into a crux of data analysis. In this paper, we propose a novel hybrid data cleaning framework, termed as MLNClean, which is capable of learning instantiated rules to supplement the insufficient integrity constraints. MLNClean consists of two steps, i.e., pre processing and two stage data cleaning. In the pre-processing step, MLNClean first infers a set of probable instantiated rules according to Markov logic network (MLN) and then builds a two-layer MLN index to generate multiple data versions and facilitate the cleaning process. In the two-stage data cleaning step, MLNClean first presents a concept of reliability score to clean errors within each data version separately, and then, it eliminates the conflict values among different data versions using a novel concept of fusion score. Considerable experimental results on both real and synthetic scenarios demonstrate the effectiveness of MLNClean.