Unsupervised meta-path selection for text similarity measure based on heterogeneous information networks

被引:18
作者
Wang, Chenguang [1 ]
Song, Yangqiu [2 ]
Li, Haoran [3 ]
Zhang, Ming [3 ]
Han, Jiawei [4 ]
机构
[1] Amazon AI, 2100 Univ Ave, East Palo Alto, CA USA
[2] HKUST, Dept CSE, Clear Water Bay, Hong Kong, Peoples R China
[3] Peking Univ, Sch EECS, Beijing, Peoples R China
[4] UIUC, Dept CS, Urbana, IL 61801 USA
基金
美国国家科学基金会; 中国国家自然科学基金;
关键词
Heterogeneous information network; Similarity; Text categorization; CLASSIFICATION;
D O I
10.1007/s10618-018-0581-y
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Heterogeneous information network (HIN) is a general representation of many different applications, such as social networks, scholar networks, and knowledge networks. A key development of HIN is called PathSim based on meta-path, which measures the pairwise similarity of two entities in the HIN of the same type. When using PathSim in practice, we usually need to handcraft some meta-paths which are paths over entity types instead of entities themselves. However, finding useful meta-paths is not trivial to human. In this paper, we present an unsupervised meta-path selection approach to automatically find useful meta-paths over HIN, and then develop a new similarity measure called KnowSim which is an ensemble of selected meta-paths. To solve the high computational cost of enumerating all possible meta-paths, we propose to use an approximate personalized PageRank algorithm to find useful subgraphs to allocate the meta-paths. We apply KnowSim to text clustering and classification problems to demonstrate that unsupervised meta-path selection can help improve the clustering and classification results. We use Freebase, a well-known world knowledge base, to conduct semantic parsing and construct HIN for documents. Our experiments on 20Newsgroups and RCV1 datasets show that KnowSim results in impressive high-quality document clustering and classification performance. We also demonstrate the approximate personalized PageRank algorithm can efficiently and effectively compute the meta-path based similarity.
引用
收藏
页码:1735 / 1767
页数:33
相关论文
共 86 条
[21]   A Comprehensive Survey of Graph Embedding: Problems, Techniques, and Applications [J].
Cai, HongYun ;
Zheng, Vincent W. ;
Chang, Kevin Chen-Chuan .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2018, 30 (09) :1616-1637
[22]  
Cui P, 2017, ARXIV171108752 CORR
[23]  
Do Q, 2009, COMPUTER SCI RES TEC, P94
[24]   Knowledge Vault: A Web-Scale Approach to Probabilistic Knowledge Fusion [J].
Dong, Xin Luna ;
Gabrilovich, Evgeniy ;
Heitz, Geremy ;
Horn, Wilko ;
Lao, Ni ;
Murphy, Kevin ;
Strohmann, Thomas ;
Sun, Shaohua ;
Zhang, Wei .
PROCEEDINGS OF THE 20TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING (KDD'14), 2014, :601-610
[25]   metapath2vec: Scalable Representation Learning for Heterogeneous Networks [J].
Dong, Yuxiao ;
Chawla, Nitesh V. ;
Swami, Ananthram .
KDD'17: PROCEEDINGS OF THE 23RD ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2017, :135-144
[26]   HIN2Vec: Explore Meta-paths in Heterogeneous Information Networks for Representation Learning [J].
Fu, Tao-yang ;
Lee, Wang-Chien ;
Lei, Zhen .
CIKM'17: PROCEEDINGS OF THE 2017 ACM CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, 2017, :1797-1806
[27]  
Gabrilovich E, 2005, 19TH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE (IJCAI-05), P1048
[28]  
Gabrilovich E, 2007, 20TH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, P1606
[29]   Exploiting hierarchical domain structure to compute similarity [J].
Ganesan, P ;
Garcia-Molina, H ;
Widom, J .
ACM TRANSACTIONS ON INFORMATION SYSTEMS, 2003, 21 (01) :64-93
[30]   Graph embedding techniques, applications, and performance: A survey [J].
Goyal, Palash ;
Ferrara, Emilio .
KNOWLEDGE-BASED SYSTEMS, 2018, 151 :78-94