Joint embeddings with multimodal cues for video-text retrieval

被引:0
|
作者
Niluthpol C. Mithun
Juncheng Li
Florian Metze
Amit K. Roy-Chowdhury
机构
[1] University of California,
[2] Carnegie Mellon University,undefined
来源
International Journal of Multimedia Information Retrieval | 2019年 / 8卷
关键词
Video-text retrieval; Joint embedding; Multimodal cues;
D O I
暂无
中图分类号
学科分类号
摘要
For multimedia applications, constructing a joint representation that could carry information for multiple modalities could be very conducive for downstream use cases. In this paper, we study how to effectively utilize available multimodal cues from videos in learning joint representations for the cross-modal video-text retrieval task. Existing hand-labeled video-text datasets are often very limited by their size considering the enormous amount of diversity the visual world contains. This makes it extremely difficult to develop a robust video-text retrieval system based on deep neural network models. In this regard, we propose a framework that simultaneously utilizes multimodal visual cues by a “mixture of experts” approach for retrieval. We conduct extensive experiments to verify that our system is able to boost the performance of the retrieval task compared to the state of the art. In addition, we propose a modified pairwise ranking loss function in training the embedding and study the effect of various loss functions. Experiments on two benchmark datasets show that our approach yields significant gain compared to the state of the art.
引用
收藏
页码:3 / 18
页数:15
相关论文
共 50 条
  • [31] CMMT: Cross-Modal Meta-Transformer for Video-Text Retrieval
    Gao, Yizhao
    Lu, Zhiwu
    PROCEEDINGS OF THE 2023 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2023, 2023, : 76 - 84
  • [32] CLIP Based Multi-Event Representation Generation for Video-Text Retrieval
    Tu R.
    Mao X.
    Kong W.
    Cai C.
    Zhao W.
    Wang H.
    Huang H.
    Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2023, 60 (09): : 2169 - 2179
  • [33] Learning a Video-Text Joint Embedding using Korean Tagged Movie Clips
    Hahm, Gyeong-June
    Kwak, Chang-Uk
    Kim, Sun-Joong
    11TH INTERNATIONAL CONFERENCE ON ICT CONVERGENCE: DATA, NETWORK, AND AI IN THE AGE OF UNTACT (ICTC 2020), 2020, : 1158 - 1160
  • [34] Fine-Grained Cross-Modal Contrast Learning for Video-Text Retrieval
    Liu, Hui
    Lv, Gang
    Gu, Yanhong
    Nian, Fudong
    ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT V, ICIC 2024, 2024, 14866 : 298 - 310
  • [35] CONTEXT-AWARE HIERARCHICAL TRANSFORMER FOR FINE-GRAINED VIDEO-TEXT RETRIEVAL
    Chen, Mingliang
    Zhang, Weimin
    Ren, Yurui
    Li, Ge
    2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 386 - 390
  • [36] VideoCLIP: A Cross-Attention Model for Fast Video-Text Retrieval Task with Image CLIP
    Li, Yikang
    Hsiao, Jenhao
    Ho, Chiuman
    PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2022, 2022, : 29 - 33
  • [37] Multi-Feature Graph Attention Network for Cross-Modal Video-Text Retrieval
    Hao, Xiaoshuai
    Zhou, Yucan
    Wu, Dayan
    Zhang, Wanqian
    Li, Bo
    Wang, Weiping
    PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL (ICMR '21), 2021, : 135 - 143
  • [38] MILES: Visual BERT Pre-training with Injected Language Semantics for Video-Text Retrieval
    Ge, Yuying
    Ge, Yixiao
    Liu, Xihui
    Wang, Jinpeng
    Wu, Jianping
    Shan, Ying
    Qie, Xiaohu
    Luo, Ping
    COMPUTER VISION - ECCV 2022, PT XXXV, 2022, 13695 : 691 - 708
  • [39] Multi-Level Cross-Modal Semantic Alignment Network for Video-Text Retrieval
    Nian, Fudong
    Ding, Ling
    Hu, Yuxia
    Gu, Yanhong
    MATHEMATICS, 2022, 10 (18)
  • [40] INTEGRATED MODALITIES AND MULTI-LEVEL GRANULARITY: TOWARDS A UNIFIED VIDEO-TEXT RETRIEVAL FRAMEWORK
    Liu, Liu
    Wang, Wenzhe
    Zhang, Zhijie
    Zhang, Mengdan
    Peng, Pai
    Sun, Xing
    2021 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA & EXPO WORKSHOPS (ICMEW), 2021,