Schema-Agnostic Entity Matching using Pre-trained Language Models

被引:10
作者
Teong, Kai-Sheng [1 ]
Soon, Lay-Ki [1 ]
Su, Tin Tin [2 ]
机构
[1] Monash Univ Malaysia, Sch Informat Technol, Subang Jaya, Malaysia
[2] Monash Univ Malaysia, Jeffery Cheah Sch Med & Hlth Sci, Subang Jaya, Malaysia
来源
CIKM '20: PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT | 2020年
关键词
Schema Agnostic; Entity Matching; Language Models;
D O I
10.1145/3340531.3412131
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Entity matching (EM) is the process of linking records from different data sources. While extensive research has been done in various aspects of EM, many of these studies generally assume EM tasks as schema-specific, which attempt to match record pairs at attributes level. Unfortunately, in the real-world, tables that undergo EM may not have an aligned schema, and often, the schema or metadata of the table and attributes are not known beforehand. In view of this challenge, this paper presents an effective approach for schema-agnostic EM, where having schema-aligned tables is not compulsory. The proposed method stemmed from the idea of treating tuples in tables for EM similar to sentence pair classification problem in natural language processing (NLP). A pre-trained language model, BERT is adopted by fine-tuning it using labeled dataset. The proposed method was experimented using benchmark datasets and compared against two state-of-the-art approaches, namely DeepMatcher and Magellan. The experimental results show that our proposed solution outperforms by an average of 9% in F-1 score. The performance is in fact consistent across different types of datasets, showing significant improvement of 29.6% for one of dirty datasets. These prove that our proposed solution is versatile for EM.
引用
收藏
页码:2241 / 2244
页数:4
相关论文
共 15 条
[1]  
Alsentzer E., 2019, P 2 CLIN NAT LANG PR, DOI DOI 10.18653/V1/W19-1909
[2]  
Bojanowski P., 2017, Transactions of the Association for Computational Linguistics, V5, P135, DOI [10.1162/tacla00051, DOI 10.1162/TACL_A_00051, DOI 10.1162/TACLA00051]
[3]  
Brunner U., 2020, EDBT, P463
[4]  
Christen P., 2012, DATA CENTRIC SYSTEMS
[5]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[6]  
Doan A, 2017, PROCEEDINGS OF THE 2ND WORKSHOP ON HUMAN-IN-THE-LOOP DATA ANALYTICS, HILDA 2017, DOI 10.1145/3077257.3077268
[7]  
Dunn HL, 1946, AM J PUBLIC HEALTH N, V36, P1412
[8]   Magellan: Toward Building Entity Matching Management Systems [J].
Konda, Pradap ;
Das, Sanjib ;
Suganthan, Paul G. C. ;
Martinkus, Philip ;
Doan, AnHai ;
Ardalan, Adel ;
Ballard, Jeffrey R. ;
Govind, Yash ;
Li, Han ;
Panahi, Fatemah ;
Zhang, Haojun ;
Naughton, Jeff ;
Prasad, Shishir ;
Krishnan, Ganesh ;
Deep, Rohit ;
Raghavendra, Vijay .
SIGMOD RECORD, 2018, 47 (01) :33-40
[9]  
Kopcke Hanna, 2012, ACM INT C P SERIES, V2012, P545, DOI [10.1145/2247596.2247662, DOI 10.1145/2247596.2247662]
[10]   Deep Learning for Entity Matching: A Design Space Exploration [J].
Mudgal, Sidharth ;
Li, Han ;
Rekatsinas, Theodoros ;
Doan, Anhai ;
Park, Youngchoon ;
Krishnan, Ganesh ;
Deep, Rohit ;
Arcaute, Esteban ;
Raghavendra, Vijay .
SIGMOD'18: PROCEEDINGS OF THE 2018 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2018, :19-34