Multimodal Network Embedding via Attention based Multi-view Variational Autoencoder

被引:26
作者
Huang, Feiran [1 ]
Zhang, Xiaoming [1 ]
Li, Chaozhuo [1 ]
Li, Zhoujun [1 ]
He, Yueying [2 ]
Zhao, Zhonghua [2 ]
机构
[1] Beihang Univ, State Key Lab Software Dev Environm, Beijing, Peoples R China
[2] Natl Comp Network Emergency Response Tech Team, Coordinat Ctr China, Beijing, Peoples R China
来源
ICMR '18: PROCEEDINGS OF THE 2018 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL | 2018年
基金
北京市自然科学基金; 中国国家自然科学基金; 国家高技术研究发展计划(863计划);
关键词
multimodal; multi-view; network embedding; VAE; attention;
D O I
10.1145/3206025.3206035
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Learning the embedding for social media data has attracted extensive research interests as well as boomed a lot of applications, such as classification and link prediction. In this paper, we examine the scenario of a multimodal network with nodes containing multimodal contents and connected by heterogeneous relationships, such as social images containing multimodal contents (e.g., visual content and text description), and linked with various forms (e.g., in the same album or with the same tag). However, given the multimodal network, simply learning the embedding from the network structure or a subset of content results in sub-optimal representation. In this paper, we propose a novel deep embedding method, i.e., Attention-based Multi-view Variational Auto-Encoder (AMVAE), to incorporate both the link information and the multimodal contents for more effective and efficient embedding. Specifically, we adopt LSTM with attention model to learn the correlation between different data modalities, such as the correlation between visual regions and the specific words, to obtain the semantic embedding of the multimodal contents. Then, the link information and the semantic embedding are considered as two correlated views. A multi-view correlation learning based Variational Auto-Encoder (VAE) is proposed to learn the representation of each node, in which the embedding of link information and multimodal contents are integrated and mutually reinforced. Experiments on three real-world datasets demonstrate the superiority of the proposed model in two applications, i.e., multi-label classification and link prediction.
引用
收藏
页码:108 / 116
页数:9
相关论文
共 38 条
[1]  
Ahmed Amr, 2013, WWW, P37
[2]  
[Anonymous], 2010, P 19 INT C WORLD WID, DOI DOI 10.1145/1772690.1772755
[3]  
[Anonymous], 2017, Proceedings of acl 2017, student research workshop
[4]  
[Anonymous], 2009, ACM INT C IM VID RET
[5]  
[Anonymous], 2014, ABS14112539 CORR
[6]  
[Anonymous], 2010, INT J COMPUT VISION, DOI DOI 10.1007/s11263-009-0275-4
[7]  
[Anonymous], 2015, P NEURIPS
[8]  
[Anonymous], 2014, PROC 20 ACM SIGKDD, DOI DOI 10.1145/2623330.2623732
[9]  
[Anonymous], 2015, JMLR P
[10]  
Bhagat S, 2011, SOCIAL NETWORK DATA ANALYTICS, P115