Deep Entity Matching: Challenges and Opportunities

被引:19
作者
Li, Yuliang [1 ]
Li, Jinfeng [1 ]
Suhara, Yoshihiko [1 ]
Wang, Jin [1 ]
Hirota, Wataru [1 ]
Tan, Wang-Chiew [1 ]
机构
[1] Megagon Labs, 444 Castro St Suite 720, Mountain View, CA 94041 USA
来源
ACM JOURNAL OF DATA AND INFORMATION QUALITY | 2021年 / 13卷 / 01期
关键词
Entity matching; entity resolution; data integration; deep learning; pretrained language models; AI;
D O I
10.1145/3431816
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Entity matching refers to the task of determining whether two different representations refer to the same realworld entity. It continues to be a prevalent problem for many organizations where data resides in different sources and duplicates the need to be identified and managed. The term "entity matching" also loosely refers to the broader problem of determining whether two heterogeneous representations of different entities should be associated together. This problem has an even wider scope of applications, from determining the subsidiaries of companies to matching jobs to job seekers, which has impactful consequences. In this article, we first report our recent system DITTO, which is an example of a modern entity matching system based on pretrained language models. Then we summarize recent solutions in applying deep learning and pre-trained language models for solving the entity matching task. Finally, we discuss research directions beyond entity matching, including the promise of synergistically integrating blocking and entity matching steps together, the need to examine methods to alleviate steep training data requirements that are typical of deep learning or pre-trained language models, and the importance of generalizing entity matching solutions to handle the broader entity matching problem, which leads to an even more pressing need to explain matching outcomes.
引用
收藏
页数:17
相关论文
共 83 条
[1]   To Index or Not to Index: Optimizing Exact Maximum Inner Product Search [J].
Abuzaid, Firas ;
Sethi, Geet ;
Bailis, Peter ;
Zaharia, Matei .
2019 IEEE 35TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2019), 2019, :1250-1261
[2]   Uncovering and Mitigating Algorithmic Bias through Learned Latent Structure [J].
Amini, Alexander ;
Soleimany, Ava P. ;
Schwarting, Wilko ;
Bhatia, Sangeeta N. ;
Rus, Daniela .
AIES '19: PROCEEDINGS OF THE 2019 AAAI/ACM CONFERENCE ON AI, ETHICS, AND SOCIETY, 2019, :289-295
[3]  
[Anonymous], 2017, NIPS TUTORIAL
[4]  
[Anonymous], 2015, EMNLP
[5]  
[Anonymous], 2011, P 49 ANN M ASS COMPU
[6]   Bias on the Web [J].
Baeza-Yates, Ricardo .
COMMUNICATIONS OF THE ACM, 2018, 61 (06) :54-61
[7]  
Bojanowski Piotr, 2017, Transactions of the Association for Computational Linguistics, V5, P135, DOI DOI 10.1162/TACL_A_00051
[8]  
Brunner U., 2020, EDBT, P463
[9]   A Declarative Framework for Linking Entities [J].
Burdick, Douglas ;
Fagin, Ronald ;
Kolaitis, Phokion G. ;
Popa, Lucian ;
Tan, Wang-Chiew .
ACM TRANSACTIONS ON DATABASE SYSTEMS, 2016, 41 (03)
[10]   Creating Embeddings of Heterogeneous Relational Datasets for Data Integration Tasks [J].
Cappuzzo, Riccardo ;
Papotti, Paolo ;
Thirumuruganathan, Saravanan .
SIGMOD'20: PROCEEDINGS OF THE 2020 ACM SIGMOD INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2020, :1335-1349