Enhancing Deep Entity Resolution with Integrated Blocker-Matcher Training: Balancing Consensus and Discrepancy

被引:0
作者
Dou, Wenzhou [1 ]
Shen, Derong [1 ]
Zhou, Xiangmin [2 ]
Bai, Hui [1 ]
Kou, Yue [1 ]
Nie, Tiezheng [1 ]
Cui, Hang [3 ]
Yu, Ge [1 ]
机构
[1] Northeastern Univ, Shenyang, Peoples R China
[2] RMIT Univ, Melbourne, Australia
[3] Univ Illinois, Urbana, IL USA
来源
PROCEEDINGS OF THE 33RD ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2024 | 2024年
基金
中国国家自然科学基金;
关键词
Entity resolution; Entity Matching; Blocking; Data integration; Pre-trained language models; Large language models;
D O I
10.1145/3627673.3679843
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Deep entity resolution (ER) identifies matching entities across data sources using techniques based on deep learning. It involves two steps: a blocker for identifying the potential matches to generate the candidate pairs, and a matcher for accurately distinguishing the matches and non-matches among these candidate pairs. Recent deep ER approaches utilize pretrained language models (PLMs) to extract similarity features for blocking and matching, achieving state-of-the-art performance. However, they often fail to balance the consensus and discrepancy between the blocker and matcher, emphasizing the consensus while neglecting the discrepancy. This paper proposes MutualER, a deep entity resolution framework that integrates and jointly trains the blocker and matcher, balancing both the consensus and discrepancy between them. Specifically, we firstly introduce a lightweight PLM in siamese structure for the blocker and a heavier PLM in cross structure or an autoregressive large language model (LLM) for the matcher. Two optimization techniques named Mutual Sample Selection (MSS) and Similarity Knowledge Transferring (SKT) are designed to jointly train the blocker and matcher. MSS enables the blocker and matcher to mutually select the customized training samples for each other to maintain the discrepancy, while SKT allows them to share the similarity knowledge for improving their blocking and matching capabilities respectively to maintain the consensus. Extensive experiments on five datasets demonstrate that MutualER significantly outperforms existing PLM-based and LLM-based approaches, achieving leading performance in both effectiveness and efficiency.
引用
收藏
页码:508 / 518
页数:11
相关论文
共 58 条
[1]   SAND: Semantic Annotation of Numeric Data in Web Tables [J].
Su, Yuchen ;
Rafiei, Davood ;
Nazari, Behrad Khorram .
PROCEEDINGS OF THE 32ND ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2023, 2023, :2342-2351
[2]   A fast linkage detection scheme for multi-source information integration [J].
Aizawa, A ;
Oyama, K .
INTERNATIONAL WORKSHOP ON CHALLENGES IN WEB INFORMATION RETRIEVAL AND INTEGRATION, PROCEEDINGS, 2005, :30-39
[3]  
Brinkmann A, 2023, Arxiv, DOI arXiv:2303.03132
[4]  
Brown TB, 2020, ADV NEUR IN, V33
[5]  
Chen Z, 2024, Arxiv, DOI [arXiv:2310.00749, 10.48550/arXiv.2310.00749]
[6]   Falcon: Scaling Up Hands-Off Crowdsourced Entity Matching to Build Cloud Services [J].
Das, Sanjib ;
Suganthan, Paul G. C. ;
Doan, AnHai ;
Naughton, Jeffrey F. ;
Krishnan, Ganesh ;
Deep, Rohit ;
Arcaute, Esteban ;
Raghavendra, Vijay ;
Park, Youngchoon .
SIGMOD'17: PROCEEDINGS OF THE 2017 ACM INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2017, :1431-1446
[7]  
Dettmers T, 2023, Arxiv, DOI [arXiv:2305.14314, DOI 10.48550/ARXIV.2305.14314]
[8]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[9]  
Dou WZ, 2023, AAAI CONF ARTIF INTE, P4259
[10]   Empowering Transformer with Hybrid Matching Knowledge for Entity Matching [J].
Dou, Wenzhou ;
Shen, Derong ;
Nie, Tiezheng ;
Kou, Yue ;
Sun, Chenchen ;
Cui, Hang ;
Yu, Ge .
DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, DASFAA 2022, PT III, 2022, :52-67