Two-Stream Convolutional Neural Network for Multimodal Matching

被引:2
作者
Zhang, Youcai [1 ]
Gu, Yiwei [1 ]
Gu, Xiaodong [1 ]
机构
[1] Fudan Univ, Dept Elect Engn, Shanghai 200433, Peoples R China
来源
ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2018, PT I | 2018年 / 11139卷
基金
中国国家自然科学基金;
关键词
Multimodal matching; Two-stream network; Convolutional neural network;
D O I
10.1007/978-3-030-01418-6_2
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Mulitimudal matching aims to establish relationship across different modalities such as image and text. Existing works mainly focus on maximizing the correlation between feature vectors extracted from the off-the-shelf models. The feature extraction and the matching are two-stage learning process. This paper presents a novel two-stream convolutional neural network that integrates the feature extraction and the matching under an end-to-end manner. Visual and textual stream are designed for feature extraction and then are concatenated with multiple shared layers for multimodal matching. The network is trained using an extreme multiclass classification loss by viewing each multimodal data as a class. Then a finetuning step is performed by a ranking constraint. Experimental results on Flickr30k datasets demonstrate the effectiveness of the proposed network for multimodal matching.
引用
收藏
页码:14 / 21
页数:8
相关论文
共 20 条
[1]  
Andrienko G., 2013, Introduction, P1
[2]  
[Anonymous], 2014, CoRR
[3]  
[Anonymous], 2014, Advances in Neural Information Processing Systems
[4]  
[Anonymous], 2017, ARXIV170600932
[5]   Latent Dirichlet allocation [J].
Blei, DM ;
Ng, AY ;
Jordan, MI .
JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (4-5) :993-1022
[6]   Canonical correlation analysis: An overview with application to learning methods [J].
Hardoon, DR ;
Szedmak, S ;
Shawe-Taylor, J .
NEURAL COMPUTATION, 2004, 16 (12) :2639-2664
[7]   Deep Residual Learning for Image Recognition [J].
He, Kaiming ;
Zhang, Xiangyu ;
Ren, Shaoqing ;
Sun, Jian .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :770-778
[8]  
Huang X, 2017, IEEE INT CON MULTI, P943, DOI 10.1109/ICME.2017.8019340
[9]   Temporal Localization and Spatial Segmentation of Joint Attention in Multiple First-Person Videos [J].
Huang, Yifei ;
Cai, Minjie ;
Kera, Hiroshi ;
Yonetani, Ryo ;
Higuchi, Keita ;
Sato, Yoichi .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2017), 2017, :2313-2321
[10]  
Li D., 2003, P 11 ACM INT C MULT, P604