Large Language Model-Based Representation Learning for Entity Resolution using Contrastive Learning

被引:0
作者
Foua, Bi T. [1 ]
Talburt, John R. [2 ]
Xu, Xiaowei [2 ]
机构
[1] Univ Arkansas Little Rock, Dept Comp Sci, Little Rock, AR 72204 USA
[2] Univ Arkansas Little Rock, Informat Sci Dept, Little Rock, AR 72204 USA
来源
2023 INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE AND COMPUTATIONAL INTELLIGENCE, CSCI 2023 | 2023年
基金
美国国家科学基金会;
关键词
Representation Learning; Large Language Models; Entity Resolution; Contrastive Learning; Entity Matching;
D O I
10.1109/CSCI62032.2023.00010
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we introduce TriBERTa, a supervised entity resolution (ER) system, that utilizes a pre-trained large language model and a triplet loss function. TriBERTa is engineered to learn representations that inherently group similar entities together while separating dissimilar ones, forming a foundational asset that can be leveraged across all steps of the ER process, including entity matching, data blocking, and data resolution. Our approach employs a two-step process: first, name entity records are fed into a Sentence Bidirectional Encoder Representations from Transformers (SBERT) model, a pre-trained language model that utilizes the BERT architecture. The language model then generates vector representations, which are subsequently fine-tuned using contrastive learning based on a triplet loss function. In the second step, the fine-tuned representations are used as input for Entity Resolution tasks in our case entity matching. Our results show that our proposed approach outperforms state-of-the-art representations, including SBERT without fine-tuning and conventional Term Frequency-Inverse Document Frequency (TF-IDF), by a margin of 3% to 19%. Additionally, in comparison to specialized end-to-end entity matching cross-encoder models, the representations generated by TriBERTa demonstrate increased robustness, maintaining consistently higher performance across a range of datasets.
引用
收藏
页码:15 / 22
页数:8
相关论文
共 24 条
[1]  
[Anonymous], CrossEncoders-SentenceTransformers documentation
[2]  
[Anonymous], About us
[3]  
[Anonymous], 2022, WWW 2022 COMP P WEB, DOI [10.1145/3487553.3524254, DOI 10.1145/3487553.3524254]
[4]  
[Anonymous], Understanding Ranking ff" 7fiLoss, Contrastive (fiff 7fiLoss, Margin)
[5]  
[Anonymous], 2018, P ACM SIGMOD INT C M, P19, DOI [10.1145/3183713.3196926, DOI 10.1145/3183713.3196926]
[6]  
Bizer C., Supervised (fiff7ff" Contrastive Learning fi for Product ffifi' Matching
[7]  
Chen T, 2020, Arxiv, DOI [arXiv:2002.05709, 10.48550/arXiv.2002.05709, DOI 10.48550/ARXIV.2002.05709]
[8]  
Christophides s, ACM !Computing Surveys
[9]  
Cohen W., Duplicate 'fffidetection, 'firecord ff" linkage, and ffff identity uncertainty:'ff/ #T Datasets
[10]  
Fang L., 2023, arXiv