Unsupervised meta-path selection for text similarity measure based on heterogeneous information networks

被引:1
作者
Chenguang Wang
Yangqiu Song
Haoran Li
Ming Zhang
Jiawei Han
机构
[1] Amazon AI,Department of CSE
[2] HKUST,School of EECS
[3] Peking University,Department of CS
[4] UIUC,undefined
来源
Data Mining and Knowledge Discovery | 2018年 / 32卷
关键词
Heterogeneous information network; Similarity; Text categorization;
D O I
暂无
中图分类号
学科分类号
摘要
Heterogeneous information network (HIN) is a general representation of many different applications, such as social networks, scholar networks, and knowledge networks. A key development of HIN is called PathSim based on meta-path, which measures the pairwise similarity of two entities in the HIN of the same type. When using PathSim in practice, we usually need to handcraft some meta-paths which are paths over entity types instead of entities themselves. However, finding useful meta-paths is not trivial to human. In this paper, we present an unsupervised meta-path selection approach to automatically find useful meta-paths over HIN, and then develop a new similarity measure called KnowSim which is an ensemble of selected meta-paths. To solve the high computational cost of enumerating all possible meta-paths, we propose to use an approximate personalized PageRank algorithm to find useful subgraphs to allocate the meta-paths. We apply KnowSim to text clustering and classification problems to demonstrate that unsupervised meta-path selection can help improve the clustering and classification results. We use Freebase, a well-known world knowledge base, to conduct semantic parsing and construct HIN for documents. Our experiments on 20Newsgroups and RCV1 datasets show that KnowSim results in impressive high-quality document clustering and classification performance. We also demonstrate the approximate personalized PageRank algorithm can efficiently and effectively compute the meta-path based similarity.
引用
收藏
页码:1735 / 1767
页数:32
相关论文
共 36 条
  • [1] Blei DM(2003)Latent Dirichlet allocation J Mach Learn Res (JMLR) 3 993-1022
  • [2] Ng AY(2006)Evaluating wordnet-based measures of lexical semantic relatedness Comput Linguist 32 13-47
  • [3] Jordan MI(2003)Exploiting hierarchical domain structure to compute similarity ACM Trans Inf Syst (TOIS) 21 64-93
  • [4] Budanitsky A(2018)Graph embedding techniques, applications, and performance: a survey Knowl Based Syst 151 78-94
  • [5] Hirst G(2010)Relational retrieval using a combination of path-constrained random walks Mach Learn 81 53-67
  • [6] Ganesan P(2004)RCV1: a new benchmark collection for text categorization research J Mach Learn Res (JMLR) 5 361-397
  • [7] Garcia-Molina H(2005)Smooth minimization of non-smooth functions Math Program 103 127-152
  • [8] Widom J(2002)Machine learning in automated text categorization ACM Comput Surv (CSUR) 34 1-47
  • [9] Goyal P(2003)Cluster ensembles-a knowledge reuse framework for combining multiple partitions J Mach Learn Res (JMLR) 3 583-617
  • [10] Ferrara E(2011)Pathsim: meta path-based top-k similarity search in heterogeneous information networks Proc VLDB Endow (PVLDB) 4 992-1003