HeteRank: A general similarity measure in heterogeneous information networks by integrating multi-type relationships

被引:20
作者
Zhang, Mingxi [1 ]
Wang, Jinhua [2 ]
Wang, Wei [2 ]
机构
[1] Univ Shanghai Sci & Technol, Shanghai, Peoples R China
[2] Fudan Univ, Shanghai, Peoples R China
基金
上海市自然科学基金;
关键词
Similarity computation; HeteRank; Information network; SEARCH; RECOMMENDATION;
D O I
10.1016/j.ins.2018.04.022
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
With heterogeneous information networks becoming ubiquitous and complex, lots of data mining tasks have been explored, including clustering, collaborative filtering and link prediction. Similarity computation is a fundamental task required for many problems of data mining. Although a large amount of similarity measures are developed for assessing similarities in heterogeneous networks, they are usually dependent on the network schema and lack a general manner for integrating kinds of relationships between objects. In this paper, we propose a similarity measure, namely HeteRank, for generally computing similarities in heterogeneous information networks. The relationships between different type objects are represented by a general relationship matrix (GRM) that is built based on the scales of different type objects. Based on GRM, HeteRank fully integrates the multi-type relationships into similarity computation by utilizing all the meetings between objects. The HeteRank equation is further transformed into a simple binomial expression form with considering restart probability. For efficiently computing HeteRank similarities, we divide the similarity computation into two steps: the first step is to compute the intermediate values, and the second step is to compute the similarities based on intermediate values. And then we approximate HeteRank equation by setting thresholds for skipping lower intermediate values and similarity scores. A pruning algorithm is developed to reduce the unnecessary visits, multiplications and additions that make little contribution during similarity computation. Extensive experiments on real datasets demonstrate the effectiveness and efficiency of HeteRank through comparing with the state-of-the-art similarity measures. (C) 2018 Elsevier Inc. All rights reserved.
引用
收藏
页码:389 / 407
页数:19
相关论文
共 52 条
[1]  
[Anonymous], TKDD
[2]  
[Anonymous], P EDBT LAUS SWITZ
[3]  
[Anonymous], 2005, P 14 INT C WORLD WID, DOI DOI 10.1145/1060745.1060839
[4]   An additional k-means clustering step improves the biological features of WGCNA gene co-expression networks [J].
Botia, Juan A. ;
Vandrovcova, Jana ;
Forabosco, Paola ;
Guelfi, Sebastian ;
D'Sa, Karishma ;
Hardy, John ;
Lewis, Cathryn M. ;
Ryten, Mina ;
Weale, Michael E. .
BMC SYSTEMS BIOLOGY, 2017, 11
[5]   Learning Semantic Similarity for Very Short Texts [J].
De Boom, Cedric ;
Van Canneyt, Steven ;
Bohez, Steven ;
Demeester, Thomas ;
Dhoedt, Bart .
2015 IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOP (ICDMW), 2015, :1229-1234
[6]   Link Prediction and Recommendation across Heterogeneous Social Networks [J].
Dong, Yuxiao ;
Tang, Jie ;
Wu, Sen ;
Tian, Jilei ;
Chawla, Nitesh V. ;
Rao, Jinghai ;
Cao, Huanhuan .
12TH IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM 2012), 2012, :181-190
[7]   An improved focused crawler based on Semantic Similarity Vector Space Model [J].
Du, Yajun ;
Liu, Wenjun ;
Lv, Xianjing ;
Peng, Guoli .
APPLIED SOFT COMPUTING, 2015, 36 :392-407
[8]   An approach for selecting seed URLs of focused crawler based on user-interest ontology [J].
Du, YaJun ;
Hai, YuFeng ;
Xie, ChunZhi ;
Wang, XiaoMing .
APPLIED SOFT COMPUTING, 2014, 14 :663-676
[9]  
Fang Y, 2016, PROC INT CONF DATA, P277, DOI 10.1109/ICDE.2016.7498247
[10]   Exploiting hierarchical domain structure to compute similarity [J].
Ganesan, P ;
Garcia-Molina, H ;
Widom, J .
ACM TRANSACTIONS ON INFORMATION SYSTEMS, 2003, 21 (01) :64-93