Joint embeddings with multimodal cues for video-text retrieval

被引：0

作者：

Niluthpol C. Mithun

Juncheng Li

Florian Metze

Amit K. Roy-Chowdhury

机构：

[1] University of California,

[2] Carnegie Mellon University,undefined

来源：

International Journal of Multimedia Information Retrieval | 2019年 / 8卷

关键词：

Video-text retrieval; Joint embedding; Multimodal cues;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

For multimedia applications, constructing a joint representation that could carry information for multiple modalities could be very conducive for downstream use cases. In this paper, we study how to effectively utilize available multimodal cues from videos in learning joint representations for the cross-modal video-text retrieval task. Existing hand-labeled video-text datasets are often very limited by their size considering the enormous amount of diversity the visual world contains. This makes it extremely difficult to develop a robust video-text retrieval system based on deep neural network models. In this regard, we propose a framework that simultaneously utilizes multimodal visual cues by a “mixture of experts” approach for retrieval. We conduct extensive experiments to verify that our system is able to boost the performance of the retrieval task compared to the state of the art. In addition, we propose a modified pairwise ranking loss function in training the embedding and study the effect of various loss functions. Experiments on two benchmark datasets show that our approach yields significant gain compared to the state of the art.

引用

页码：3 / 18

页数：15

共 50 条

[21] LSECA: local semantic enhancement and cross aggregation for video-text retrieval
Wang, Zhiwen
Zhang, Donglin
Hu, Zhikai
INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2024, 13 (03)
[22] Self-expressive induced clustered attention for video-text retrieval
Zhu, Jingxuan
Shen, Xiangjun
Mehta, Sumet
Abeo, Timothy Apasiba
Zhan, Yongzhao
MULTIMEDIA SYSTEMS, 2024, 30 (06)
[23] Deep Semantic Multimodal Hashing Network for Scalable Image-Text and Video-Text Retrievals
Jin, Lu
Li, Zechao
Tang, Jinhui
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, 34 (04) : 1838 - 1851
[24] Transferring Image-CLIP to Video-Text Retrieval via Temporal Relations
Fang, Han
Xiong, Pengfei
Xu, Luhui
Luo, Wenhan
IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 7772 - 7785
[25] Multilevel Semantic Interaction Alignment for Video-Text Cross-Modal Retrieval
Chen, Lei
Deng, Zhen
Liu, Libo
Yin, Shibai
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (07) : 6559 - 6575
[26] Concept Propagation via Attentional Knowledge Graph Reasoning for Video-Text Retrieval
Fang, Sheng
Wang, Shuhui
Zhuo, Junbao
Huang, Qingming
Ma, Bin
Wei, Xiaoming
Wei, Xiaolin
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4789 - 4800
[27] FeatInter: Exploring fine-grained object features for video-text retrieval
Liu, Baolong
Zheng, Qi
Wang, Yabing
Zhang, Minsong
Dong, Jianfeng
Wang, Xun
NEUROCOMPUTING, 2022, 496 : 178 - 191
[28] MAC: Masked Contrastive Pre-Training for Efficient Video-Text Retrieval
Shu, Fangxun
Chen, Biaolong
Liao, Yue
Wang, Jinqiao
Liu, Si
IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 9962 - 9972
[29] Learning Coarse-to-Fine Graph Neural Networks for Video-Text Retrieval
Wang, Wei
Gao, Junyu
Yang, Xiaoshan
Xu, Changsheng
IEEE TRANSACTIONS ON MULTIMEDIA, 2021, 23 : 2386 - 2397
[30] Hierarchical Cross-Modal Graph Consistency Learning for Video-Text Retrieval
Jin, Weike
Zhao, Zhou
Zhang, Pengcheng
Zhu, Jieming
He, Xiuqiang
Zhuang, Yueting
SIGIR '21 - PROCEEDINGS OF THE 44TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2021, : 1114 - 1124

← 1 2 3 4 5 →