Unsupervised meta-path selection for text similarity measure based on heterogeneous information networks

被引：1

作者：

Chenguang Wang

Yangqiu Song

Haoran Li

Ming Zhang

Jiawei Han

机构：

[1] Amazon AI,Department of CSE

[2] HKUST,School of EECS

[3] Peking University,Department of CS

[4] UIUC,undefined

来源：

Data Mining and Knowledge Discovery | 2018年 / 32卷

关键词：

Heterogeneous information network; Similarity; Text categorization;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

Heterogeneous information network (HIN) is a general representation of many different applications, such as social networks, scholar networks, and knowledge networks. A key development of HIN is called PathSim based on meta-path, which measures the pairwise similarity of two entities in the HIN of the same type. When using PathSim in practice, we usually need to handcraft some meta-paths which are paths over entity types instead of entities themselves. However, finding useful meta-paths is not trivial to human. In this paper, we present an unsupervised meta-path selection approach to automatically find useful meta-paths over HIN, and then develop a new similarity measure called KnowSim which is an ensemble of selected meta-paths. To solve the high computational cost of enumerating all possible meta-paths, we propose to use an approximate personalized PageRank algorithm to find useful subgraphs to allocate the meta-paths. We apply KnowSim to text clustering and classification problems to demonstrate that unsupervised meta-path selection can help improve the clustering and classification results. We use Freebase, a well-known world knowledge base, to conduct semantic parsing and construct HIN for documents. Our experiments on 20Newsgroups and RCV1 datasets show that KnowSim results in impressive high-quality document clustering and classification performance. We also demonstrate the approximate personalized PageRank algorithm can efficiently and effectively compute the meta-path based similarity.

引用

页码：1735 / 1767

页数：32

共 36 条

[1] Blei DM(2003)Latent Dirichlet allocation J Mach Learn Res (JMLR) 3 993-1022
[2] Ng AY(2006)Evaluating wordnet-based measures of lexical semantic relatedness Comput Linguist 32 13-47
[3] Jordan MI(2003)Exploiting hierarchical domain structure to compute similarity ACM Trans Inf Syst (TOIS) 21 64-93
[4] Budanitsky A(2018)Graph embedding techniques, applications, and performance: a survey Knowl Based Syst 151 78-94
[5] Hirst G(2010)Relational retrieval using a combination of path-constrained random walks Mach Learn 81 53-67
[6] Ganesan P(2004)RCV1: a new benchmark collection for text categorization research J Mach Learn Res (JMLR) 5 361-397
[7] Garcia-Molina H(2005)Smooth minimization of non-smooth functions Math Program 103 127-152
[8] Widom J(2002)Machine learning in automated text categorization ACM Comput Surv (CSUR) 34 1-47
[9] Goyal P(2003)Cluster ensembles-a knowledge reuse framework for combining multiple partitions J Mach Learn Res (JMLR) 3 583-617
[10] Ferrara E(2011)Pathsim: meta path-based top-k similarity search in heterogeneous information networks Proc VLDB Endow (PVLDB) 4 992-1003

← 1 2 3 4 →