Unsupervised meta-path selection for text similarity measure based on heterogeneous information networks

被引：18

作者：

Wang, Chenguang ^{[1
]}

Song, Yangqiu ^{[2
]}

Li, Haoran ^{[3
]}

Zhang, Ming ^{[3
]}

Han, Jiawei ^{[4
]}

机构：

[1] Amazon AI, 2100 Univ Ave, East Palo Alto, CA USA

[2] HKUST, Dept CSE, Clear Water Bay, Hong Kong, Peoples R China

[3] Peking Univ, Sch EECS, Beijing, Peoples R China

[4] UIUC, Dept CS, Urbana, IL 61801 USA

来源：

DATA MINING AND KNOWLEDGE DISCOVERY | 2018年 / 32卷 / 06期

基金：

美国国家科学基金会; 中国国家自然科学基金;

关键词：

Heterogeneous information network; Similarity; Text categorization; CLASSIFICATION;

D O I：

10.1007/s10618-018-0581-y

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Heterogeneous information network (HIN) is a general representation of many different applications, such as social networks, scholar networks, and knowledge networks. A key development of HIN is called PathSim based on meta-path, which measures the pairwise similarity of two entities in the HIN of the same type. When using PathSim in practice, we usually need to handcraft some meta-paths which are paths over entity types instead of entities themselves. However, finding useful meta-paths is not trivial to human. In this paper, we present an unsupervised meta-path selection approach to automatically find useful meta-paths over HIN, and then develop a new similarity measure called KnowSim which is an ensemble of selected meta-paths. To solve the high computational cost of enumerating all possible meta-paths, we propose to use an approximate personalized PageRank algorithm to find useful subgraphs to allocate the meta-paths. We apply KnowSim to text clustering and classification problems to demonstrate that unsupervised meta-path selection can help improve the clustering and classification results. We use Freebase, a well-known world knowledge base, to conduct semantic parsing and construct HIN for documents. Our experiments on 20Newsgroups and RCV1 datasets show that KnowSim results in impressive high-quality document clustering and classification performance. We also demonstrate the approximate personalized PageRank algorithm can efficiently and effectively compute the meta-path based similarity.

引用

页码：1735 / 1767

页数：33

共 86 条

[21] A Comprehensive Survey of Graph Embedding: Problems, Techniques, and Applications [J].

Cai, HongYun ;

Zheng, Vincent W. ;

Chang, Kevin Chen-Chuan .

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2018, 30 (09) :1616-1637

[22]

Cui P, 2017, ARXIV171108752 CORR

[23]

Do Q, 2009, COMPUTER SCI RES TEC, P94

[24] Knowledge Vault: A Web-Scale Approach to Probabilistic Knowledge Fusion [J].

Dong, Xin Luna ;

Gabrilovich, Evgeniy ;

Heitz, Geremy ;

Horn, Wilko ;

Lao, Ni ;

Murphy, Kevin ;

Strohmann, Thomas ;

Sun, Shaohua ;

Zhang, Wei .

PROCEEDINGS OF THE 20TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING (KDD'14), 2014, :601-610

[25] metapath2vec: Scalable Representation Learning for Heterogeneous Networks [J].

Dong, Yuxiao ;

Chawla, Nitesh V. ;

Swami, Ananthram .

KDD'17: PROCEEDINGS OF THE 23RD ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2017, :135-144

[26] HIN2Vec: Explore Meta-paths in Heterogeneous Information Networks for Representation Learning [J].

Fu, Tao-yang ;

Lee, Wang-Chien ;

Lei, Zhen .

CIKM'17: PROCEEDINGS OF THE 2017 ACM CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, 2017, :1797-1806

[27]

Gabrilovich E, 2005, 19TH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE (IJCAI-05), P1048

[28]

Gabrilovich E, 2007, 20TH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, P1606

[29] Exploiting hierarchical domain structure to compute similarity [J].

Ganesan, P ;

Garcia-Molina, H ;

Widom, J .

ACM TRANSACTIONS ON INFORMATION SYSTEMS, 2003, 21 (01) :64-93

[30] Graph embedding techniques, applications, and performance: A survey [J].

Goyal, Palash ;

Ferrara, Emilio .

KNOWLEDGE-BASED SYSTEMS, 2018, 151 :78-94

← 1 2 3 4 5 6 7 8 9 →