Graph Deep Active Learning Framework for Data Deduplication

被引:3
作者
Cao, Huan [1 ,2 ]
Du, Shengdong [1 ,2 ]
Hu, Jie [1 ,2 ]
Yang, Yan [1 ,2 ]
Horng, Shi-Jinn [3 ]
Li, Tianrui [1 ,2 ]
机构
[1] Southwest Jiaotong Univ, Sch Comp & Artificial Intelligence, Chengdu 611756, Peoples R China
[2] Minist Educ, Engn Res Ctr Sustainable Urban Intelligent Transpo, Chengdu 611756, Peoples R China
[3] Asia Univ, Coll Informat & Elect Engn, Chongsheng 41359, Peoples R China
基金
国家重点研发计划; 中国国家自然科学基金;
关键词
Computational modeling; Bidirectional control; Big Data; Filtering algorithms; Feature extraction; Information filters; Data models; data deduplication; active learning; similarity calculation;
D O I
10.26599/BDMA.2023.9020040
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
With the advent of the era of big data, an increasing amount of duplicate data are expressed in different forms. In order to reduce redundant data storage and improve data quality, data deduplication technology has never become more significant than nowadays. It is usually necessary to connect multiple data tables and identify different records pointing to the same entity, especially in the case of multi-source data deduplication. Active learning trains the model by selecting the data items with the maximum information divergence and reduces the data to be annotated, which has unique advantages in dealing with big data annotations. However, most of the current active learning methods only employ classical entity matching and are rarely applied to data deduplication tasks. To fill this research gap, we propose a novel graph deep active learning framework for data deduplication, which is based on similarity algorithms combined with the bidirectional encoder representations from transformers (BERT) model to extract the deep similarity features of multi-source data records, and first introduce the graph active learning strategy to build a clean graph to filter the data that needs to be labeled, which is used to delete the duplicate data that retain the most information. Experimental results on real-world datasets demonstrate that the proposed method outperforms state-of-the-art active learning models on data deduplication tasks.
引用
收藏
页码:753 / 764
页数:12
相关论文
共 25 条
[1]  
Ahamed I., 2021, Bull. Electr. Eng. Inform., V10, P9
[2]   Combining a context aware neural network with a denoising autoencoder for measuring string similarities [J].
Ben Lazreg, Mehdi ;
Goodwin, Morten ;
Granmo, Ole-Christoffer .
COMPUTER SPEECH AND LANGUAGE, 2020, 60
[3]  
Bilgic Mustafa, 2010, Proceedings of the 27th International Conference on Machine Learning ICML-10
[4]   Heterogeneous Committee-Based Active Learning for Entity Resolution (HeALER) [J].
Chen, Xiao ;
Xu, Yinlong ;
Broneske, David ;
Durand, Gabriel Campero ;
Zoun, Roman ;
Saake, Gunter .
ADVANCES IN DATABASES AND INFORMATION SYSTEMS, ADBIS 2019, 2019, 11695 :69-85
[5]   An Overview of End-to-End Entity Resolution for Big Data [J].
Christophides, Vassilis ;
Efthymiou, Vasilis ;
Palpanas, Themis ;
Papadakis, George ;
Stefanidis, Kostas .
ACM COMPUTING SURVEYS, 2021, 53 (06)
[6]  
Costa LD, 2021, Arxiv, DOI [arXiv:2110.09619, DOI 10.48550/ARXIV.2110.09619]
[7]   Convolutional Embedding for Edit Distance [J].
Dai, Xinyan ;
Yan, Xiao ;
Zhou, Kaiwen ;
Wang, Yuxuan ;
Yang, Han ;
Cheng, James .
PROCEEDINGS OF THE 43RD INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '20), 2020, :599-608
[8]  
Grinsztajn L., 2022, arXiv, DOI 10.48550/ARXIV.2207.08815
[9]  
Holechek J. L., 2021, Strategiesfor Rangeland Management, P425
[10]  
Jain A, 2022, Arxiv, DOI arXiv:2104.03986