Joint embeddings with multimodal cues for video-text retrieval

被引:0
|
作者
Niluthpol C. Mithun
Juncheng Li
Florian Metze
Amit K. Roy-Chowdhury
机构
[1] University of California,
[2] Carnegie Mellon University,undefined
来源
International Journal of Multimedia Information Retrieval | 2019年 / 8卷
关键词
Video-text retrieval; Joint embedding; Multimodal cues;
D O I
暂无
中图分类号
学科分类号
摘要
For multimedia applications, constructing a joint representation that could carry information for multiple modalities could be very conducive for downstream use cases. In this paper, we study how to effectively utilize available multimodal cues from videos in learning joint representations for the cross-modal video-text retrieval task. Existing hand-labeled video-text datasets are often very limited by their size considering the enormous amount of diversity the visual world contains. This makes it extremely difficult to develop a robust video-text retrieval system based on deep neural network models. In this regard, we propose a framework that simultaneously utilizes multimodal visual cues by a “mixture of experts” approach for retrieval. We conduct extensive experiments to verify that our system is able to boost the performance of the retrieval task compared to the state of the art. In addition, we propose a modified pairwise ranking loss function in training the embedding and study the effect of various loss functions. Experiments on two benchmark datasets show that our approach yields significant gain compared to the state of the art.
引用
收藏
页码:3 / 18
页数:15
相关论文
共 50 条
  • [21] LSECA: local semantic enhancement and cross aggregation for video-text retrieval
    Wang, Zhiwen
    Zhang, Donglin
    Hu, Zhikai
    INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2024, 13 (03)
  • [22] Self-expressive induced clustered attention for video-text retrieval
    Zhu, Jingxuan
    Shen, Xiangjun
    Mehta, Sumet
    Abeo, Timothy Apasiba
    Zhan, Yongzhao
    MULTIMEDIA SYSTEMS, 2024, 30 (06)
  • [23] Deep Semantic Multimodal Hashing Network for Scalable Image-Text and Video-Text Retrievals
    Jin, Lu
    Li, Zechao
    Tang, Jinhui
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, 34 (04) : 1838 - 1851
  • [24] Transferring Image-CLIP to Video-Text Retrieval via Temporal Relations
    Fang, Han
    Xiong, Pengfei
    Xu, Luhui
    Luo, Wenhan
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 7772 - 7785
  • [25] Multilevel Semantic Interaction Alignment for Video-Text Cross-Modal Retrieval
    Chen, Lei
    Deng, Zhen
    Liu, Libo
    Yin, Shibai
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (07) : 6559 - 6575
  • [26] Concept Propagation via Attentional Knowledge Graph Reasoning for Video-Text Retrieval
    Fang, Sheng
    Wang, Shuhui
    Zhuo, Junbao
    Huang, Qingming
    Ma, Bin
    Wei, Xiaoming
    Wei, Xiaolin
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4789 - 4800
  • [27] FeatInter: Exploring fine-grained object features for video-text retrieval
    Liu, Baolong
    Zheng, Qi
    Wang, Yabing
    Zhang, Minsong
    Dong, Jianfeng
    Wang, Xun
    NEUROCOMPUTING, 2022, 496 : 178 - 191
  • [28] MAC: Masked Contrastive Pre-Training for Efficient Video-Text Retrieval
    Shu, Fangxun
    Chen, Biaolong
    Liao, Yue
    Wang, Jinqiao
    Liu, Si
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 9962 - 9972
  • [29] Learning Coarse-to-Fine Graph Neural Networks for Video-Text Retrieval
    Wang, Wei
    Gao, Junyu
    Yang, Xiaoshan
    Xu, Changsheng
    IEEE TRANSACTIONS ON MULTIMEDIA, 2021, 23 : 2386 - 2397
  • [30] Hierarchical Cross-Modal Graph Consistency Learning for Video-Text Retrieval
    Jin, Weike
    Zhao, Zhou
    Zhang, Pengcheng
    Zhu, Jieming
    He, Xiuqiang
    Zhuang, Yueting
    SIGIR '21 - PROCEEDINGS OF THE 44TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2021, : 1114 - 1124