Joint embeddings with multimodal cues for video-text retrieval

被引:0
作者
Niluthpol C. Mithun
Juncheng Li
Florian Metze
Amit K. Roy-Chowdhury
机构
[1] University of California,
[2] Carnegie Mellon University,undefined
来源
International Journal of Multimedia Information Retrieval | 2019年 / 8卷
关键词
Video-text retrieval; Joint embedding; Multimodal cues;
D O I
暂无
中图分类号
学科分类号
摘要
For multimedia applications, constructing a joint representation that could carry information for multiple modalities could be very conducive for downstream use cases. In this paper, we study how to effectively utilize available multimodal cues from videos in learning joint representations for the cross-modal video-text retrieval task. Existing hand-labeled video-text datasets are often very limited by their size considering the enormous amount of diversity the visual world contains. This makes it extremely difficult to develop a robust video-text retrieval system based on deep neural network models. In this regard, we propose a framework that simultaneously utilizes multimodal visual cues by a “mixture of experts” approach for retrieval. We conduct extensive experiments to verify that our system is able to boost the performance of the retrieval task compared to the state of the art. In addition, we propose a modified pairwise ranking loss function in training the embedding and study the effect of various loss functions. Experiments on two benchmark datasets show that our approach yields significant gain compared to the state of the art.
引用
收藏
页码:3 / 18
页数:15
相关论文
共 27 条
[1]  
Fraz MM(2012)An ensemble classification-based approach applied to retinal blood vessel segmentation IEEE Trans Biomed Eng 59 2538-2548
[2]  
Remagnino P(2014)A multi-view embedding space for modeling internet images, tags, and their semantics Int J Comput Vis 106 210-233
[3]  
Hoppe A(2004)Canonical correlation analysis: an overview with application to learning methods Neural Comput 16 2639-2664
[4]  
Uyyanonvara B(2013)Framing image description as a ranking task: data, models and evaluation metrics J Artif Intell Res 47 853-899
[5]  
Rudnicka AR(2006)Ensemble based systems in decision making IEEE Circuits Syst Mag 6 21-45
[6]  
Owen CG(2007)Bootstrap inspired techniques in computational intelligence: ensemble of classifiers, incremental learning, data fusion and missing features IEEE Signal Process Mag 24 59-72
[7]  
Barman SA(2018)A crossmodal approach to multimodal fusion in video hyperlinking IEEE Multimed 25 11-23
[8]  
Gong Y(2017)Places: a 10 million image database for scene recognition IEEE Trans Pattern Anal Mach Intell 40 1452-1464
[9]  
Ke Q(undefined)undefined undefined undefined undefined-undefined
[10]  
Isard M(undefined)undefined undefined undefined undefined-undefined