ERBlox: Combining matching dependencies with machine learning for entity resolution

被引:19
作者
Bahmani, Zeinab [1 ]
Bertossi, Leopoldo [1 ]
Vasiloglou, Nikolaos [2 ]
机构
[1] Carleton Univ, Sch Comp Sci, Ottawa, ON, Canada
[2] LogicBlox Inc, Atlanta, GA 30309 USA
基金
加拿大自然科学与工程研究理事会;
关键词
Entity resolution; Matching dependencies; Support-vector machines; Classification; Datalog; BLOCKING TECHNIQUES; LINKAGE;
D O I
10.1016/j.ijar.2017.01.003
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Entity resolution (ER), an important and common data cleaning problem, is about detecting data duplicate representations for the same external entities, and merging them into single representations. Relatively recently, declarative rules called matching dependencies (MDs) have been proposed for specifying similarity conditions under which attribute values in database records are merged. In this work we show the process and the benefits of integrating four components of ER: (a) Building a classifier for "duplicate/non-duplicate record pairs built using machine learning (ML) techniques; (b) Use of MDs for supporting the blocking phase of ML; (c) Record merging on the basis of the classifier results; and (d) The use of the declarative language LogiQL-an extended form of Datalog supported by the LogicBlox platform-for all activities related to data processing, and the specification and enforcement of MDs. (C) 2017 Elsevier Inc. All rights reserved.
引用
收藏
页码:118 / 141
页数:24
相关论文
共 51 条
  • [1] Abiteboul S, 1995, FDN DATABASES
  • [2] [Anonymous], 1999, Technical report
  • [3] [Anonymous], 2007, Data quality and record linkage techniques
  • [4] [Anonymous], 2007, Introduction to statistical relational learning
  • [5] [Anonymous], 2007, ACM Transactions on Knowledge Discovery from Data (TKDD), DOI [DOI 10.1145/1217299.1217304, 10.1145/1217299.1217304]
  • [6] [Anonymous], 2009, CIKM
  • [7] [Anonymous], 2000, NATURE STAT LEARNING, DOI DOI 10.1007/978-1-4757-3264-1
  • [8] [Anonymous], 2015, Data classification: Algorithms and Applications
  • [9] [Anonymous], 2009, Proc. VLDB Endow., DOI DOI 10.14778/1687627.1687674
  • [10] [Anonymous], 2003, P KDD 2003 WORKSH DA