Efficient link-based similarity search in web networks

被引：10

作者：

Zhang, Mingxi ^{[1
,2
]}

Hu, Hao ^{[2
]}

He, Zhenying ^{[2
]}

Gao, Liping ^{[1
,2
]}

Sun, Liujie ^{[1
]}

机构：

[1] Univ Shanghai Sci & Technol, Shanghai 200093, Peoples R China

[2] Fudan Univ, Shanghai 201203, Peoples R China

来源：

EXPERT SYSTEMS WITH APPLICATIONS | 2015年 / 42卷 / 22期

基金：

美国国家科学基金会;

关键词：

Similarity search; Web network; WebSim; STRUCTURAL SIMILARITY; COMPUTATION;

D O I：

10.1016/j.eswa.2015.07.042

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Similarity search in web networks, aiming to find entities similar to the given entity, is one of the core tasks in network analysis. With the proliferation of web applications, including web search and recommendation system, SimRank has been a well-known measure for evaluating entity similarity in a network. However, the existing work computes SimRank iteratively over a huge similarity matrix, which is expensive in terms of time and space cost and cannot efficiently support similarity search over large networks. In this paper, we propose a link-based similarity search method, WebSim, towards efficiently finding similar entities in web networks. WebSim defines the similarity between entities as the 2-hop similarity of SimRank. To reduce computation cost, we divide the similarity search process into two stages: off-line stage and on-line stage. In the off-line stage, the 1-hop similarities are computed, and an optimized algorithm is designed to reduce the unnecessary accumulation operations on zero similarities. In the on-line stage, the 2-hop similarities are computed, and a pruning algorithm is developed to support fast query processing through searching similar entries from a partial sums index derived from the 1-hop similarities. The index items that are lower than a given threshold are skipped to reduce the searching space. Compared to the iterative SimRank computation, the time and space cost of similarity computation is significantly reduced, since WebSim maintains only the similarity matrix of 1-hop that is much smaller than that of multi-hop. Experiments through comparison with SimRank and its optimized algorithms demonstrate that WebSim has on average a 99.83% reduction in the time cost and a 92.12% reduction in the space cost of similarity computation, and achieves on average 99.98% NDCG. (C) 2015 Elsevier Ltd. All rights reserved.

引用

页码：8868 / 8880

页数：13

共 43 条

[1] Ontology-based similarity for product information retrieval
Akmal, Suriati
Shih, Li-Hsing
Batres, Rafael
[J]. COMPUTERS IN INDUSTRY, 2014, 65 (01) : 91 - 107
[2] [Anonymous], 2002, P 8 ACM SIGKDD INT C
[3] [Anonymous], 2003, ACM SIGKDD Explor Newsl, DOI [DOI 10.1145/980972.980992, 10.1145/980972.980992]
[4] [Anonymous], 1972, TECHNICAL REPORT
[5] [Anonymous], 2006, P 32 INT C VER LARG
[6] A systematic review of scholar context-aware recommender systems
Champiri, Zohreh Dehghani
Shahamiri, Seyed Reza
Salim, Siti Salwah Binti
[J]. EXPERT SYSTEMS WITH APPLICATIONS, 2015, 42 (03) : 1743 - 1758
[7] Categorical data clustering: What similarity measure to recommend?
dos Santos, Tiago R. L.
Zarate, Luis E.
[J]. EXPERT SYSTEMS WITH APPLICATIONS, 2015, 42 (03) : 1247 - 1260
[8] Probabilistic SimRank computation over uncertain graphs
Du, Lingxia
Li, Cuiping
Chen, Hong
Tan, Liwen
Zhang, Yinglong
[J]. INFORMATION SCIENCES, 2015, 295 : 521 - 535
[9] Faloutsos M, 1999, COMP COMM R, V29, P251, DOI 10.1145/316194.316229
[10] Exploiting hierarchical domain structure to compute similarity
Ganesan, P
Garcia-Molina, H
Widom, J
[J]. ACM TRANSACTIONS ON INFORMATION SYSTEMS, 2003, 21 (01) : 64 - 93

← 1 2 3 4 5 →