Entity resolution framework using rough set blocking for heterogeneous web of data

被引:5
作者
Vidhya, K. A. [1 ]
Geetha, T. V. [1 ]
机构
[1] Anna Univ, Dept Comp Sci, Madras, Tamil Nadu, India
关键词
Entity resolution; blocking; rough set; heterogeneous data; linked open data; RECORD LINKAGE; DEDUPLICATION;
D O I
10.3233/JIFS-17946
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Entity Resolution (ER) is the method of resolving two similar entities used in the process of data cleaning and data integration. However, existing ER Framework lead to exhaustive pairwise comparisons. The most efficient ER method is blocking, inherently uses exponential pair-wise comparisons for the large databases, leading to poor efficiency in resolving the entities. The real world data can either be homogeneous or heterogeneous, generally of two forms, clean-clean ER which does not have any duplicates or dirty-ER which have duplicates within the dataset. Entity Resolution framework is associated with two phases namely the block building phase which construct the blocks where the similar entities are grouped into a single block for effective indexing, while the aim of block processing phase is to reduce the number of redundant pair-wise comparisons. Another perspective is handling of the entity associated with heterogeneous data, in the proposed work the block building phase aims to gather related entities with different representations into a single block with an approximation space. For this purpose semantic-dominance rough set has been used to cluster the attributes of related entities having a varied schema. The similarity between the entities associated with the clustered attributes is determined using a rough-Jaccard similarity measure, grouped to form blocks of varied, but limited size. The pair-wise comparisons between the blocks of entities are carried out only when the lower approximation of the blocks are same, determined by the proposed multi-criteria Pareto optimality, else the entities are not compared, which signifies, the overall number of pair-wise comparisons is reduced. A performance analysis of the proposed technique has been tested on four real-world, highly heterogeneous datasets, and the validation of these algorithms has yielded 99.98% effectiveness and 98.3% efficiency in block comparison when compared to token blocking and attribute clustering methods.
引用
收藏
页码:659 / 675
页数:17
相关论文
共 41 条
[1]   A fast linkage detection scheme for multi-source information integration [J].
Aizawa, A ;
Oyama, K .
INTERNATIONAL WORKSHOP ON CHALLENGES IN WEB INFORMATION RETRIEVAL AND INTEGRATION, PROCEEDINGS, 2005, :30-39
[2]  
[Anonymous], 2007, WWW
[3]  
Araujo S., 2012, WEBDB
[4]  
Bahmani Bahman., 2012, P 21 ACM INT C INFOR, P2174, DOI [DOI 10.1145/2396761.2398596, 10.1145/2396761.2398596]
[5]  
Bilenko M, 2006, IEEE DATA MINING, P87
[7]  
De Vries T., 2009, P 18 ACM C INF KNOWL
[8]   Duplicate record detection: A survey [J].
Elmagarmid, Ahmed K. ;
Ipeirotis, Panagiotis G. ;
Verykios, Vassilios S. .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2007, 19 (01) :1-16
[9]   A THEORY FOR RECORD LINKAGE [J].
FELLEGI, IP ;
SUNTER, AB .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1969, 64 (328) :1183-&
[10]  
Fisher J., 2015, KDD 15