Large Language Model-Based Representation Learning for Entity Resolution using Contrastive Learning

被引：0

作者：

Foua, Bi T. ^{[1
]}

Talburt, John R. ^{[2
]}

Xu, Xiaowei ^{[2
]}

机构：

[1] Univ Arkansas Little Rock, Dept Comp Sci, Little Rock, AR 72204 USA

[2] Univ Arkansas Little Rock, Informat Sci Dept, Little Rock, AR 72204 USA

来源：

2023 INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE AND COMPUTATIONAL INTELLIGENCE, CSCI 2023 | 2023年

基金：

美国国家科学基金会;

关键词：

Representation Learning; Large Language Models; Entity Resolution; Contrastive Learning; Entity Matching;

D O I：

10.1109/CSCI62032.2023.00010

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In this paper, we introduce TriBERTa, a supervised entity resolution (ER) system, that utilizes a pre-trained large language model and a triplet loss function. TriBERTa is engineered to learn representations that inherently group similar entities together while separating dissimilar ones, forming a foundational asset that can be leveraged across all steps of the ER process, including entity matching, data blocking, and data resolution. Our approach employs a two-step process: first, name entity records are fed into a Sentence Bidirectional Encoder Representations from Transformers (SBERT) model, a pre-trained language model that utilizes the BERT architecture. The language model then generates vector representations, which are subsequently fine-tuned using contrastive learning based on a triplet loss function. In the second step, the fine-tuned representations are used as input for Entity Resolution tasks in our case entity matching. Our results show that our proposed approach outperforms state-of-the-art representations, including SBERT without fine-tuning and conventional Term Frequency-Inverse Document Frequency (TF-IDF), by a margin of 3% to 19%. Additionally, in comparison to specialized end-to-end entity matching cross-encoder models, the representations generated by TriBERTa demonstrate increased robustness, maintaining consistently higher performance across a range of datasets.

引用

页码：15 / 22

页数：8

共 24 条

[1]

[Anonymous], CrossEncoders-SentenceTransformers documentation

[2]

[Anonymous], About us

[3]

[Anonymous], 2022, WWW 2022 COMP P WEB, DOI [10.1145/3487553.3524254, DOI 10.1145/3487553.3524254]

[4]

[Anonymous], Understanding Ranking ff" 7fiLoss, Contrastive (fiff 7fiLoss, Margin)

[5]

[Anonymous], 2018, P ACM SIGMOD INT C M, P19, DOI [10.1145/3183713.3196926, DOI 10.1145/3183713.3196926]

[6]

Bizer C., Supervised (fiff7ff" Contrastive Learning fi for Product ffifi' Matching

[7]

Chen T, 2020, Arxiv, DOI [arXiv:2002.05709, 10.48550/arXiv.2002.05709, DOI 10.48550/ARXIV.2002.05709]

[8]

Christophides s, ACM !Computing Surveys

[9]

Cohen W., Duplicate 'fffidetection, 'firecord ff" linkage, and ffff identity uncertainty:'ff/ #T Datasets

[10]

Fang L., 2023, arXiv

← 1 2 3 →