Efficient entity resolution based on subgraph cohesion

被引:7
作者
Wang, Hongzhi [1 ]
Li, Jianzhong [1 ]
Gao, Hong [1 ]
机构
[1] Harbin Inst Technol, Dept Comp Sci & Technol, Harbin 150006, Peoples R China
关键词
Entity resolution; Data quality; Graph cohesion;
D O I
10.1007/s10115-015-0818-7
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Entity resolution has wide applications and receives considerable attentions in literature. For entity resolution, similarity functions are often used to judge whether two data objects refer to the same real-world entity. However, the similar relations determined by many commonly used similarity functions lack transitivity. This fact results in the conflict that and refer to the same entity and and refer to the same entity, but and do not refer to the same entity. To address this problem and make the group-wise entity resolution results consistent with pairwise entity resolution, this paper models the entity resolution problem as the partition of the vertices in a weighted graph into cohesive subgraphs, which is proven to be co-NP-complete. To solve this problem, an approximate algorithm with approximation ratio bound is proposed. For performing entity resolution on a large data set efficiently, a heuristic algorithm is developed to address this problem. In order to implement the heuristic algorithm efficiently, a similarity measure compatible with many measures in common usage is presented. With such similarity measure, indices and efficient implementations for the heuristic algorithm are proposed. Extensive experiments have been performed to verify the efficiency and effectiveness of the methods in this paper.
引用
收藏
页码:285 / 314
页数:30
相关论文
共 31 条
  • [1] [Anonymous], 2004, Proc. of the ACM SIGMOD Workshop on Data Mining and Knowledge Discovery, DOI DOI 10.1145/1008694.1008697
  • [2] [Anonymous], 2005, AAAI
  • [3] [Anonymous], AAAI
  • [4] [Anonymous], 1978, P 10 ANN ACM S THEOR, DOI DOI 10.1145/800133.804355
  • [5] [Anonymous], 2007, VLDB
  • [6] [Anonymous], 2007, IEEE 23 INT C DATA E
  • [7] [Anonymous], 2002, P 8 ACM SIGKDD INT C, DOI DOI 10.1145/775047.775087
  • [8] Large-Scale Deduplication with Constraints using Dedupalog
    Arasu, Arvind
    Re, Christopher
    Suciu, Dan
    [J]. ICDE: 2009 IEEE 25TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, VOLS 1-3, 2009, : 952 - 963
  • [9] Aslam J.A., 2004, J GRAPH ALGORITHMS A, V8, P95, DOI [10.7155/jgaa.00084, DOI 10.7155/JGAA.00084]
  • [10] Bayardo R.J., 2007, WWW, P131, DOI [DOI 10.1145/1242572.1242591, 10.1145/1242572.1242591]