Unsupervised meta-path selection for text similarity measure based on heterogeneous information networks

被引：18

作者：

Wang, Chenguang ^{[1
]}

Song, Yangqiu ^{[2
]}

Li, Haoran ^{[3
]}

Zhang, Ming ^{[3
]}

Han, Jiawei ^{[4
]}

机构：

[1] Amazon AI, 2100 Univ Ave, East Palo Alto, CA USA

[2] HKUST, Dept CSE, Clear Water Bay, Hong Kong, Peoples R China

[3] Peking Univ, Sch EECS, Beijing, Peoples R China

[4] UIUC, Dept CS, Urbana, IL 61801 USA

来源：

DATA MINING AND KNOWLEDGE DISCOVERY | 2018年 / 32卷 / 06期

基金：

美国国家科学基金会; 中国国家自然科学基金;

关键词：

Heterogeneous information network; Similarity; Text categorization; CLASSIFICATION;

D O I：

10.1007/s10618-018-0581-y

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Heterogeneous information network (HIN) is a general representation of many different applications, such as social networks, scholar networks, and knowledge networks. A key development of HIN is called PathSim based on meta-path, which measures the pairwise similarity of two entities in the HIN of the same type. When using PathSim in practice, we usually need to handcraft some meta-paths which are paths over entity types instead of entities themselves. However, finding useful meta-paths is not trivial to human. In this paper, we present an unsupervised meta-path selection approach to automatically find useful meta-paths over HIN, and then develop a new similarity measure called KnowSim which is an ensemble of selected meta-paths. To solve the high computational cost of enumerating all possible meta-paths, we propose to use an approximate personalized PageRank algorithm to find useful subgraphs to allocate the meta-paths. We apply KnowSim to text clustering and classification problems to demonstrate that unsupervised meta-path selection can help improve the clustering and classification results. We use Freebase, a well-known world knowledge base, to conduct semantic parsing and construct HIN for documents. Our experiments on 20Newsgroups and RCV1 datasets show that KnowSim results in impressive high-quality document clustering and classification performance. We also demonstrate the approximate personalized PageRank algorithm can efficiently and effectively compute the meta-path based similarity.

引用

页码：1735 / 1767

页数：33

共 86 条

[1]

Aggarwal Charu C, 2012, Mining text data, P163, DOI [DOI 10.1007/978-1-4614-3223-46, DOI 10.1007/978-1-4614-3223-4, 10.1007/978-1-4614-3223-4]

[2]

Andersen R, 2006, ANN IEEE SYMP FOUND, P475

[3]

[Anonymous], 1998, P AAAI 98 WORKSH LEA, DOI DOI 10.1109/TSMC.1985.6313426

[4]

[Anonymous], 2008, International Conference on Research and Development in Information Retrieval, DOI [10.1145/, DOI 10.1145/1390334.1390367]

[5]

[Anonymous], 2005, Advances in Neural Information Processing Systems

[6]

[Anonymous], 2011, P EMNLP

[7]

[Anonymous], 2009, Advances in neural information processing systems

[8]

[Anonymous], 2004, P 10 ACM SIGKDD INT, DOI DOI 10.1145/1014052.1014062

[9]

[Anonymous], 2004, WWW '04, DOI DOI 10.1145/988672.988687

[10]

[Anonymous], 2012, Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, DOI DOI 10.1145/2339530.2339738

← 1 2 3 4 5 6 7 8 9 →