Entity resolution for media metadata based on structural clustering

被引:0
作者
Gu, Qi [1 ,2 ]
Cao, Jian [1 ]
Liu, Yancen [1 ]
机构
[1] Shanghai Jiao Tong Univ, Dept Comp Sci & Engn, Sch Elect Informat & Elect Engn, 800 Dongchuan Rd, Shanghai 200240, Peoples R China
[2] Nantong Univ, Sch Informat Sci & Technol, Dept Comp Sci, 9 Seyuan Rd, Nantong 226019, Peoples R China
关键词
Entity resolution; Structural clustering; Iterative propagation; Graph structure;
D O I
10.1007/s11042-019-08062-6
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
An increasing amount of media metadata are published by different organizations on the Web which leads to a fragmented dataset landscape. Identifying media metadata from disparate datasets and integrating heterogeneous datasets have many applications but also pose significant challenges. To tackle this problem, entity resolution methods are commonly used as an essential prerequisite for integrating media information from different sources and effectively foster the re-use of existing data sources. While the amount of media metadata published on the Web grows steadily, how to scale it well to large media knowledge bases while maintaining a high matching quality is a critical challenge. This article investigates the relationships between media entities. To that end, the media database is formulated as a knowledge graph with entities as nodes and the associations between related entities as edges. Thus, media entities are grouped into communities by how they share neighbors. Then, a structural clustering-based model is proposed to detect communities and discover anchor vertices as well as isolated vertices. Specifically, an initial seed set of matched anchor vertex pairs is obtained. Furthermore, an iterative propagation approach for identifying the matched entities in the whole graph is developed, where community similarity is introduced into the measure function to control the total measurement of candidate pairs. Therefore, starting with the elements of the initial seed set, the entity resolution algorithm updates the matching information over the whole network along with the neighbor relationships iteratively. Extensive experiments are conducted on real datasets to evaluate how the seed set impacts the matching process and performance. The experiment results show this model can achieve an excellent balance between accuracy and efficiency and is a clear improvement compared to state-of-the-art methods.
引用
收藏
页码:219 / 242
页数:24
相关论文
共 32 条
  • [1] Balduzzi M, 2010, LECT NOTES COMPUT SC, V6307, P422, DOI 10.1007/978-3-642-15512-3_22
  • [2] Bates S, 2005, IEEE PACIF, P85
  • [3] Baxter R., 2003, ACM SIGKDD 03 WORKSH, P25, DOI DOI 10.1007/978-3-319-11257-2
  • [4] Bhattacharya I., 2007, ACM Transactions on Knowledge Discovery from Data (TKDD), V1, P5, DOI [DOI 10.1145/1217299.1217304, 10.1145/1217299.1217304]
  • [6] Doan A, 2005, AI MAG, V26, P83
  • [7] Duplicate record detection: A survey
    Elmagarmid, Ahmed K.
    Ipeirotis, Panagiotis G.
    Verykios, Vassilios S.
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2007, 19 (01) : 1 - 16
  • [8] A THEORY FOR RECORD LINKAGE
    FELLEGI, IP
    SUNTER, AB
    [J]. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1969, 64 (328) : 1183 - &
  • [9] Entity Resolution: Theory, Practice & Open Challenges
    Getoor, Lise
    Machanavajjhala, Ashwin
    [J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2012, 5 (12): : 2018 - 2019
  • [10] Gu Q, 2014, 2014 INTERNATIONAL CONFERENCE ON DATA SCIENCE AND ADVANCED ANALYTICS (DSAA), P97, DOI 10.1109/DSAA.2014.7058058