An effective weighted rule-based method for entity resolution

被引:9
作者
Abu Ahmad, Hiba [1 ]
Wang, Hongzhi [1 ]
机构
[1] Harbin Inst Technol, Dept Comp Sci, Harbin, Heilongjiang, Peoples R China
关键词
Digital libraries; Entity resolution; Data cleaning; Rule learning; IDENTIFICATION;
D O I
10.1007/s10619-018-7240-6
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Entity resolution is an important task in data cleaning to detect records that belong to the same entity. It has a critical impact on digital libraries where different entities share the same name without any identifier key. Conventional methods adopt similarity measures and clustering techniques to reveal the records of a specific entity. Due to the lack of performance, recent methods build rules on records' attributes with distinct values for entities to overcome some drawbacks. However, they use inadequate attributes and ignore common and empty attributes values which affect the quality of entity resolution. In this paper, we define a multi-attributes weighted rule system (MAWR) that investigates all values of records' attributes in order to represent the difficult record-entity mapping. Then, we propose a rule generation algorithm based on this system. We also propose an entity resolution algorithm (MAWR-ER) depending on the generated rules to identify entities. We verify our method on real data, and the experimental results prove the effectiveness and efficiency of our proposed method.
引用
收藏
页码:593 / 612
页数:20
相关论文
共 14 条
[1]  
[Anonymous], 2003, 9 ACM SIGKDD INTCONF, DOI DOI 10.1145/956750.956759
[2]  
[Anonymous], 2011, ACM J DATA INF QUAL, DOI DOI 10.1145/1891879.1891883
[3]  
[Anonymous], 2009, Proc. VLDB Endow., DOI DOI 10.14778/1687627.1687674
[4]   Entity resolution for distributed probabilistic data [J].
Ayat, Naser ;
Akbarinia, Reza ;
Afsarmanesh, Hamideh ;
Valduriez, Patrick .
DISTRIBUTED AND PARALLEL DATABASES, 2013, 31 (04) :509-542
[5]  
Chaudhuri S, 2005, PROC INT CONF DATA, P865
[6]  
Gravano Luis., 2003, P 12 INT C WORLD WID, P90
[7]   Real-world data is dirty: Data cleansing and the merge/purge problem [J].
Hernandez, MA ;
Stolfo, SJ .
DATA MINING AND KNOWLEDGE DISCOVERY, 1998, 2 (01) :9-37
[8]  
Li Jianzhong, 2011, P 20 ACM INT C INF K, P1725
[9]   Rule-Based Method for Entity Resolution [J].
Li, Lingli ;
Li, Jianzhong ;
Gao, Hong .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2015, 27 (01) :250-263
[10]  
Li LL, 2010, LECT NOTES COMPUT SC, V6184, P717