FROM INTRA-MODAL TO INTER-MODAL SPACE: MULTI-TASK LEARNING OF SHARED REPRESENTATIONS FOR CROSS-MODAL RETRIEVAL

被引:0
作者
Choi, Jaeyoung [1 ,3 ]
Larson, Martha [1 ,2 ]
Friedland, Gerald [4 ]
Hanjalic, Alan [1 ]
机构
[1] Delft Univ Techonol, Delft, Netherlands
[2] Radboud Univ Nijmegen, Nijmegen, Netherlands
[3] Int Comp Sci Inst, Berkeley, CA USA
[4] Univ Calif Berkeley, Berkeley, CA 94720 USA
来源
2019 IEEE FIFTH INTERNATIONAL CONFERENCE ON MULTIMEDIA BIG DATA (BIGMM 2019) | 2019年
关键词
cross-modal retrieval; multi-task learning; video retrieval; image retrieval;
D O I
10.1109/BigMM.2019.00014
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Learning a robust shared representation space is critical for effective multimedia retrieval, and is increasingly important as multimodal data grows in volume and diversity. The labeled datasets necessary for learning such a space are limited in size and also in coverage of semantic concepts. These limitations constrain performance: a shared representation learned on one dataset may not generalize well to another. We address this issue by building on the insight that, given limited data, it is easier to optimize the semantic structure of a space within a modality, than across modalities. We propose a two-stage shared representation learning framework with intra-modal optimization and subsequent cross-modal transfer learning of semantic structure that produces a robust shared representation space. We integrate multi-task learning into each step, making it possible to leverage multiple datasets, annotated with different concepts, as if they were one large dataset. Large-scale systematic experiments demonstrate improvements over previously reported state-of-the-art methods on cross-modal retrieval tasks.
引用
收藏
页码:1 / 10
页数:10
相关论文
共 50 条
[1]  
[Anonymous], P ACM INT C MULT
[2]  
Aytar Y., 2017, ARXIV170600932
[3]  
Bartlett PL, 2000, IEEE DECIS CONTR P, P124, DOI 10.1109/CDC.2000.912744
[4]   Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].
Carreira, Joao ;
Zisserman, Andrew .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733
[5]   Multitask learning [J].
Caruana, R .
MACHINE LEARNING, 1997, 28 (01) :41-75
[6]  
Chen David, 2011, ACL, P190
[7]  
Chua T.-S., 2009, P ACM INT C IM VID R
[8]  
Devlin J., 2018, ARXIV
[9]   Dual Encoding for Zero-Example Video Retrieval [J].
Dong, Jianfeng ;
Li, Xirong ;
Xu, Chaoxi ;
Ji, Shouling ;
He, Yuan ;
Yang, Gang ;
Wang, Xun .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :9338-9347
[10]   Predicting Visual Features From Text for Image and Video Caption Retrieval [J].
Dong, Jianfeng ;
Li, Xirong ;
Snoek, Cees G. M. .
IEEE TRANSACTIONS ON MULTIMEDIA, 2018, 20 (12) :3377-3388