Joint embeddings with multimodal cues for video-text retrieval

被引：0

作者：

Niluthpol C. Mithun

Juncheng Li

Florian Metze

Amit K. Roy-Chowdhury

机构：

[1] University of California,

[2] Carnegie Mellon University,undefined

来源：

International Journal of Multimedia Information Retrieval | 2019年 / 8卷

关键词：

Video-text retrieval; Joint embedding; Multimodal cues;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

For multimedia applications, constructing a joint representation that could carry information for multiple modalities could be very conducive for downstream use cases. In this paper, we study how to effectively utilize available multimodal cues from videos in learning joint representations for the cross-modal video-text retrieval task. Existing hand-labeled video-text datasets are often very limited by their size considering the enormous amount of diversity the visual world contains. This makes it extremely difficult to develop a robust video-text retrieval system based on deep neural network models. In this regard, we propose a framework that simultaneously utilizes multimodal visual cues by a “mixture of experts” approach for retrieval. We conduct extensive experiments to verify that our system is able to boost the performance of the retrieval task compared to the state of the art. In addition, we propose a modified pairwise ranking loss function in training the embedding and study the effect of various loss functions. Experiments on two benchmark datasets show that our approach yields significant gain compared to the state of the art.

引用

页码：3 / 18

页数：15

共 50 条

[31] CMMT: Cross-Modal Meta-Transformer for Video-Text Retrieval
Gao, Yizhao
Lu, Zhiwu
PROCEEDINGS OF THE 2023 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2023, 2023, : 76 - 84
[32] CLIP Based Multi-Event Representation Generation for Video-Text Retrieval
Tu R.
Mao X.
Kong W.
Cai C.
Zhao W.
Wang H.
Huang H.
Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2023, 60 (09): : 2169 - 2179
[33] Learning a Video-Text Joint Embedding using Korean Tagged Movie Clips
Hahm, Gyeong-June
Kwak, Chang-Uk
Kim, Sun-Joong
11TH INTERNATIONAL CONFERENCE ON ICT CONVERGENCE: DATA, NETWORK, AND AI IN THE AGE OF UNTACT (ICTC 2020), 2020, : 1158 - 1160
[34] Fine-Grained Cross-Modal Contrast Learning for Video-Text Retrieval
Liu, Hui
Lv, Gang
Gu, Yanhong
Nian, Fudong
ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT V, ICIC 2024, 2024, 14866 : 298 - 310
[35] CONTEXT-AWARE HIERARCHICAL TRANSFORMER FOR FINE-GRAINED VIDEO-TEXT RETRIEVAL
Chen, Mingliang
Zhang, Weimin
Ren, Yurui
Li, Ge
2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 386 - 390
[36] VideoCLIP: A Cross-Attention Model for Fast Video-Text Retrieval Task with Image CLIP
Li, Yikang
Hsiao, Jenhao
Ho, Chiuman
PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2022, 2022, : 29 - 33
[37] Multi-Feature Graph Attention Network for Cross-Modal Video-Text Retrieval
Hao, Xiaoshuai
Zhou, Yucan
Wu, Dayan
Zhang, Wanqian
Li, Bo
Wang, Weiping
PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL (ICMR '21), 2021, : 135 - 143
[38] MILES: Visual BERT Pre-training with Injected Language Semantics for Video-Text Retrieval
Ge, Yuying
Ge, Yixiao
Liu, Xihui
Wang, Jinpeng
Wu, Jianping
Shan, Ying
Qie, Xiaohu
Luo, Ping
COMPUTER VISION - ECCV 2022, PT XXXV, 2022, 13695 : 691 - 708
[39] Multi-Level Cross-Modal Semantic Alignment Network for Video-Text Retrieval
Nian, Fudong
Ding, Ling
Hu, Yuxia
Gu, Yanhong
MATHEMATICS, 2022, 10 (18)
[40] INTEGRATED MODALITIES AND MULTI-LEVEL GRANULARITY: TOWARDS A UNIFIED VIDEO-TEXT RETRIEVAL FRAMEWORK
Liu, Liu
Wang, Wenzhe
Zhang, Zhijie
Zhang, Mengdan
Peng, Pai
Sun, Xing
2021 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA & EXPO WORKSHOPS (ICMEW), 2021,

← 1 2 3 4 5 →