Joint embeddings with multimodal cues for video-text retrieval

被引：0

作者：

Niluthpol C. Mithun

Juncheng Li

Florian Metze

Amit K. Roy-Chowdhury

机构：

[1] University of California,

[2] Carnegie Mellon University,undefined

来源：

International Journal of Multimedia Information Retrieval | 2019年 / 8卷

关键词：

Video-text retrieval; Joint embedding; Multimodal cues;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

For multimedia applications, constructing a joint representation that could carry information for multiple modalities could be very conducive for downstream use cases. In this paper, we study how to effectively utilize available multimodal cues from videos in learning joint representations for the cross-modal video-text retrieval task. Existing hand-labeled video-text datasets are often very limited by their size considering the enormous amount of diversity the visual world contains. This makes it extremely difficult to develop a robust video-text retrieval system based on deep neural network models. In this regard, we propose a framework that simultaneously utilizes multimodal visual cues by a “mixture of experts” approach for retrieval. We conduct extensive experiments to verify that our system is able to boost the performance of the retrieval task compared to the state of the art. In addition, we propose a modified pairwise ranking loss function in training the embedding and study the effect of various loss functions. Experiments on two benchmark datasets show that our approach yields significant gain compared to the state of the art.

引用

页码：3 / 18

页数：15

共 50 条

[41] An empirical study of excitation and aggregation design adaptions in CLIP4Clip for video-text retrieval
Jing, Xiaolun
Yang, Genke
Chu, Jian
NEUROCOMPUTING, 2024, 596
[42] CLIP4Hashing: Unsupervised Deep Hashing for Cross-Modal Video-Text Retrieval
Zhuo, Yaoxin
Li, Yikang
Hsiao, Jenhao
Ho, Chiuman
Li, Baoxin
PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2022, 2022, : 158 - 166
[43] Many Hands Make Light Work: Transferring Knowledge From Auxiliary Tasks for Video-Text Retrieval
Wang, Wei
Gao, Junyu
Yang, Xiaoshan
Xu, Changsheng
IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 2661 - 2674
[44] X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval
Ma, Yiwei
Xu, Guohai
Sun, Xiaoshuai
Yan, Ming
Zhang, Ji
Ji, Rongrong
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022,
[45] Video-text retrieval via multi-modal masked transformer and adaptive attribute-aware graph convolutional network
Lv, Gang
Sun, Yining
Nian, Fudong
MULTIMEDIA SYSTEMS, 2024, 30 (01)
[46] Learning to Embed Semantic Similarity for Joint Image-Text Retrieval
Malali, Noam
Keller, Yosi
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (12) : 10252 - 10260
[47] Sinkhorn Transformations for Single-Query Postprocessing in Text-Video Retrieval
Yakovlev, Konstantin
Polyakov, Gregory
Alimova, Ilseyar
Podolskiy, Alexander
Bout, Andrey
Nikolenko, Sergey
Piontkovskaya, Irina
PROCEEDINGS OF THE 46TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2023, 2023, : 2394 - 2398
[48] Reading-Strategy Inspired Visual Representation Learning for Text-to-Video Retrieval
Dong, Jianfeng
Wang, Yabing
Chen, Xianke
Qu, Xiaoye
Li, Xirong
He, Yuan
Wang, Xun
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (08) : 5680 - 5694
[49] Webly Supervised Joint Embedding for Cross-Modal Image-Text Retrieval
Mithun, Niluthpol Chowdhury
Panda, Rameswar
Papalexakis, Evangelos E.
Roy-Chowdhury, Amit K.
PROCEEDINGS OF THE 2018 ACM MULTIMEDIA CONFERENCE (MM'18), 2018, : 1856 - 1864
[50] S2CA: Shared Concept Prototypes and Concept-level Alignment for text-video retrieval
Li, Yuxiao
Xin, Yu
Qian, Jiangbo
Dong, Yihong
NEUROCOMPUTING, 2025, 614

← 1 2 3 4 5 →