Entity Matching on Unstructured Data: An Active Learning Approach

被引:9
作者
Brunner, Ursin [1 ]
Stockinger, Kurt [1 ]
机构
[1] ZHAW Zurich Univ Appl Sci, Zurich, Switzerland
来源
2019 6TH SWISS CONFERENCE ON DATA SCIENCE (SDS) | 2019年
关键词
D O I
10.1109/SDS.2019.00006
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
With the growing number of data sources in enterprises, entity matching becomes a crucial part of every data integration project. In order to reduce the human effort involved in identifying matching entities between different database tables, typically machine learning algorithms are applied. Moreover, active learning is often combined with supervised machine learning methods to further reduce the effort of labeling entities as true or false matches. However, while state-of-the-art active learning algorithms have proven to work well on structured data sets, unstructured data still poses a challenge in entity matching. This paper proposes an end-to-end entity matching pipeline to minimize the human labeling effort for entity matching on unstructured data sets. We use several natural language processing techniques such as soft tf-idf to pre-process the record pairs before we classify them using a novel Active Learning with Uncertainty Sampling (ALWUS) algorithm. We designed our algorithm as a plugin system to work with any state-of-the-art classifier such as support vector machines, random forests or deep neural networks. Detailed experimental results demonstrate that our end-to-end entity matching pipeline clearly outperforms comparable entity matching approaches on an unstructured real-word data set. Our approach achieves significantly better scores (F1-score) while using 1 to 2 orders of magnitude fewer human labeling efforts than existing state-of-the-art algorithms.
引用
收藏
页码:97 / 102
页数:6
相关论文
共 16 条