Unsupervised meta-path selection for text similarity measure based on heterogeneous information networks

被引:18
作者
Wang, Chenguang [1 ]
Song, Yangqiu [2 ]
Li, Haoran [3 ]
Zhang, Ming [3 ]
Han, Jiawei [4 ]
机构
[1] Amazon AI, 2100 Univ Ave, East Palo Alto, CA USA
[2] HKUST, Dept CSE, Clear Water Bay, Hong Kong, Peoples R China
[3] Peking Univ, Sch EECS, Beijing, Peoples R China
[4] UIUC, Dept CS, Urbana, IL 61801 USA
基金
美国国家科学基金会; 中国国家自然科学基金;
关键词
Heterogeneous information network; Similarity; Text categorization; CLASSIFICATION;
D O I
10.1007/s10618-018-0581-y
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Heterogeneous information network (HIN) is a general representation of many different applications, such as social networks, scholar networks, and knowledge networks. A key development of HIN is called PathSim based on meta-path, which measures the pairwise similarity of two entities in the HIN of the same type. When using PathSim in practice, we usually need to handcraft some meta-paths which are paths over entity types instead of entities themselves. However, finding useful meta-paths is not trivial to human. In this paper, we present an unsupervised meta-path selection approach to automatically find useful meta-paths over HIN, and then develop a new similarity measure called KnowSim which is an ensemble of selected meta-paths. To solve the high computational cost of enumerating all possible meta-paths, we propose to use an approximate personalized PageRank algorithm to find useful subgraphs to allocate the meta-paths. We apply KnowSim to text clustering and classification problems to demonstrate that unsupervised meta-path selection can help improve the clustering and classification results. We use Freebase, a well-known world knowledge base, to conduct semantic parsing and construct HIN for documents. Our experiments on 20Newsgroups and RCV1 datasets show that KnowSim results in impressive high-quality document clustering and classification performance. We also demonstrate the approximate personalized PageRank algorithm can efficiently and effectively compute the meta-path based similarity.
引用
收藏
页码:1735 / 1767
页数:33
相关论文
共 86 条
[1]  
Aggarwal Charu C, 2012, Mining text data, P163, DOI [DOI 10.1007/978-1-4614-3223-46, DOI 10.1007/978-1-4614-3223-4, 10.1007/978-1-4614-3223-4]
[2]  
Andersen R, 2006, ANN IEEE SYMP FOUND, P475
[3]  
[Anonymous], 1998, P AAAI 98 WORKSH LEA, DOI DOI 10.1109/TSMC.1985.6313426
[4]  
[Anonymous], 2008, International Conference on Research and Development in Information Retrieval, DOI [10.1145/, DOI 10.1145/1390334.1390367]
[5]  
[Anonymous], 2005, Advances in Neural Information Processing Systems
[6]  
[Anonymous], 2011, P EMNLP
[7]  
[Anonymous], 2009, Advances in neural information processing systems
[8]  
[Anonymous], 2004, P 10 ACM SIGKDD INT, DOI DOI 10.1145/1014052.1014062
[9]  
[Anonymous], 2004, WWW '04, DOI DOI 10.1145/988672.988687
[10]  
[Anonymous], 2012, Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, DOI DOI 10.1145/2339530.2339738