Low-resource entity resolution with domain generalization and active learning

被引:0
作者
Xu, Zhihong [1 ]
Wang, Ning [1 ]
机构
[1] Beijing Jiaotong Univ, Sch Comp & Informat Technol, Beijing 100044, Peoples R China
基金
中国国家自然科学基金;
关键词
Entity resolution; Domain generalization; Active learning; Data preprocessing;
D O I
10.1016/j.neucom.2024.128131
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Entity Resolution (ER), a fundamental task in data cleaning and integration, is critical in various fields such as healthcare, e-commerce, and social networks. Traditional ER methods are constrained by the need for substantial labeled samples and the challenge of generalization to unseen domains. To address the low-resource challenge in ER, a novel two-phase framework is proposed. Initially, we introduce the Domain Generalization Entity Resolution (DGER) framework, combining domain adversarial learning and simulated target learning to improve the generalization performance on unseen target domains. Subsequently, to further adapt to the target dataset, we present a novel active learning approach called Domain-Aware Uncertainty Active Learning (DUAL), for fine-tuning the DGER model with minimal annotation cost. DUAL manually annotates target domain samples that are highly uncertain and exhibit high divergence from the source, while assigning pseudo- labels to high-confidence samples. Experimental results on multiple real-world datasets demonstrate that our framework outperforms traditional ER methods in generalizing to unseen domains. Specifically, our DGER method outperforms the best-performing ER baseline in each task, achieving an average F1 score improvement of 9.02% across eight different test tasks. Moreover, within a limited annotation budget during the active learning phase, our DUAL fine-tuning strategy for the ER model outperforms uncertainty-based active learning techniques.
引用
收藏
页数:13
相关论文
共 55 条
[1]   Magellan: Toward Building Ecosystems of Entity Matching Solutions [J].
AnHai Doan ;
Konda, Pradap ;
Suganthan, Paul G. C. ;
Govind, Yash ;
Paulsen, Derek ;
Chandrasekhar, Kaushik ;
Martinkus, Philip ;
Christie, Matthew .
COMMUNICATIONS OF THE ACM, 2020, 63 (08) :83-91
[2]   Blocking Techniques for Entity Linkage: A Semantics-Based Approach [J].
Azzalini, Fabio ;
Jin, Songle ;
Renzi, Marco ;
Tanca, Letizia .
DATA SCIENCE AND ENGINEERING, 2021, 6 (01) :20-38
[3]  
Bilenko M., 2003, 9 ACM SIGKDD INTCONF, P39, DOI DOI 10.1145/956750.956759
[4]   Cost-effective Variational Active Entity Resolution [J].
Bogatu, Alex ;
Paton, Norman W. ;
Douthwaite, Mark ;
Davie, Stuart ;
Freitas, Andre .
2021 IEEE 37TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2021), 2021, :1272-1283
[5]   Maximizing Expected Model Change for Active Learning in Regression [J].
Cai, Wenbin ;
Zhang, Ya ;
Zhou, Jun .
2013 IEEE 13TH INTERNATIONAL CONFERENCE ON DATA MINING (ICDM), 2013, :51-60
[6]   Cost-Effective Crowdsourced Entity Resolution: A Partial-Order Approach [J].
Chai, Chengliang ;
Li, Guoliang ;
Li, Jian ;
Deng, Dong ;
Feng, Jianhua .
SIGMOD'16: PROCEEDINGS OF THE 2016 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2016, :969-984
[7]   Table Search Using a Deep Contextualized Language Model [J].
Chen, Zhiyu ;
Trabelsi, Mohamed ;
Heflin, Jeff ;
Xu, Yinan ;
Davison, Brian D. .
PROCEEDINGS OF THE 43RD INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '20), 2020, :589-598
[8]   Achieving Approximate Global Optimization of Truth Inference for Crowdsourcing Microtasks [J].
Cui, Lizhen ;
Chen, Jing ;
He, Wei ;
Li, Hui ;
Guo, Wei ;
Su, Zhiyuan .
DATA SCIENCE AND ENGINEERING, 2021, 6 (03) :294-309
[9]  
D'Innocente Antonio, 2019, Pattern Recognition. 40th German Conference, GCPR 2018. Proceedings: Lecture Notes in Computer Science (LNCS 11269), P187, DOI 10.1007/978-3-030-12939-2_14
[10]  
Devlin J, 2019, Arxiv, DOI [arXiv:1810.04805, DOI 10.48550/ARXIV.1810.04805]