MDL-CW: A Multimodal Deep Learning Framework with Cross Weights

被引:39
作者
Rastegar, Sarah [1 ]
Baghshah, Mandieh Soleymani [1 ]
Rabiee, Hamid R. [1 ]
Shojaee, Seyed Mohsen [1 ]
机构
[1] Sharif Univ Technol, Dept Comp Engn, AICT Innovat Ctr, Tehran, Iran
来源
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2016年
关键词
D O I
10.1109/CVPR.2016.285
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Deep learning has received much attention as of the most powerful approaches for multimodal representation learning in recent years. An ideal model for multimodal data can reason about missing modalities using the available ones, and usually provides more information when multiple modalities are being considered. All the previous deep models contain separate modality-specific networks and find a shared representation on top of those networks. Therefore, they only consider high level interactions between modalities to find a joint representation for them. In this paper, we propose a multimodal deep learning framework (MDL-CW) that exploits the cross weights between representation of modalities, and try to gradually learn interactions of the modalities in a deep network manner (from low to high level interactions). Moreover, we theoretically show that considering these interactions provide more intra-modality information, and introduce a multi-stage pre-training method that is based on the properties of multi-modal data. In the proposed framework, as opposed to the existing deep methods for multi-modal data, we try to reconstruct the representation of each modality at a given level, with representation of other modalities in the previous layer. Extensive experimental results show that the proposed model outperforms state-of-the-art information retrieval methods for both image and text queries on the PASCAL-sentence and SUN-Attribute databases.
引用
收藏
页码:2601 / 2609
页数:9
相关论文
共 26 条
[1]  
[Anonymous], 2013, ICML
[2]  
[Anonymous], 2012, P INT C NEUR INF PRO
[3]  
[Anonymous], 2012, ARXIV12071423
[4]  
[Anonymous], 2013, P 27 ANN C NEUR INF, DOI DOI 10.48550/ARXIV.1305.6663
[5]  
[Anonymous], 2014, Advances in neural information processing systems
[6]  
[Anonymous], 1998, The mnist database of handwritten digits
[7]  
Bronstein MM, 2010, PROC CVPR IEEE, P3594, DOI 10.1109/CVPR.2010.5539928
[8]   Every Picture Tells a Story: Generating Sentences from Images [J].
Farhadi, Ali ;
Hejrati, Mohsen ;
Sadeghi, Mohammad Amin ;
Young, Peter ;
Rashtchian, Cyrus ;
Hockenmaier, Julia ;
Forsyth, David .
COMPUTER VISION-ECCV 2010, PT IV, 2010, 6314 :15-+
[9]  
Frome A., 2013, ADV NEURAL INFORM PR, P2121, DOI DOI 10.5555/2999792.2999849
[10]  
Geoffrey EHinton., 2012, Improving neural networks by preventing co-adaptation of feature detectors