QA-Matcher: Unsupervised Entity Matching Using a Question Answering Model

被引:0
作者
Hayashi, Shogo [1 ,3 ]
Dong, Yuyang [2 ]
Oyamada, Masafumi [2 ]
机构
[1] BizReach Inc, Tokyo, Japan
[2] NEC Corp Ltd, Tokyo, Japan
[3] NEC Corp Ltd, Tokyo, Japan
来源
ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PAKDD 2023, PT IV | 2023年 / 13938卷
关键词
entity matching; question answering;
D O I
10.1007/978-3-031-33383-5_14
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Entity matching (EM) is a fundamental task in data integration, which involves identifying records that refer to the same real-world entity. Unsupervised EM is often preferred in real-world applications, as labeling data is often a labor-intensive process. However, existing unsupervised methods may not always perform well because the assumptions for these methods may not hold for tasks in different domains. In this paper, we propose QA-Matcher, an unsupervised EM model that is domain-agnostic and doesn't require any particular assumptions. Our idea is to frame EM as question answering (QA) by utilizing a trained QA model. Specifically, we generate a question that asks which record has the characteristics of a particular record and a passage that describes other records. We then use the trained QA model to predict the record pair that corresponds to the question-answer as a match. QA-Matcher leverages the power of a QA model to represent the semantics of various types of entities, allowing it to identify identical entities in a QA-like fashion. In extensive experiments on 16 real-world datasets, we demonstrate that QA-Matcher outperforms unsupervised EM methods and is competitive with supervised methods.
引用
收藏
页码:174 / 185
页数:12
相关论文
共 24 条
  • [1] Bojanowski P., 2017, Trans. ACL, V5, P135, DOI [DOI 10.1162/TACLA00051, 10.1162/tacla00051, 10.1162/tacl_a_00051, DOI 10.1162/TACL_A_00051]
  • [2] Creating Embeddings of Heterogeneous Relational Datasets for Data Integration Tasks
    Cappuzzo, Riccardo
    Papotti, Paolo
    Thirumuruganathan, Saravanan
    [J]. SIGMOD'20: PROCEEDINGS OF THE 2020 ACM SIGMOD INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2020, : 1335 - 1349
  • [3] Cohen W. W., 2002, Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD'02, P475
  • [4] Das Sanjib, The Magellan Data Repository
  • [5] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
  • [6] A THEORY FOR RECORD LINKAGE
    FELLEGI, IP
    SUNTER, AB
    [J]. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1969, 64 (328) : 1183 - &
  • [7] Fu C, 2020, PROCEEDINGS OF THE TWENTY-NINTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, P3665
  • [8] Ge C., 2021, IEEE Trans. Knowl. Data Eng., V1
  • [9] Iyyer M., 2014, P 2014 C EMPIRICAL M, V1, P633
  • [10] Deep Transfer Learning for Multi-source Entity Linkage via Domain Adaptation
    Jin, Di
    Sisman, Bunyamin
    Wei, Hao
    Dong, Xin Luna
    Koutra, Danai
    [J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2021, 15 (03): : 465 - 477