ConTra: (Con)text (Tra)nsformer for Cross-Modal Video Retrieval

被引:0
|
作者
Fragomeni, Adriano [1 ]
Wray, Michael [1 ]
Damen, Dima [1 ]
机构
[1] Univ Bristol, Dept Comp Sci, Bristol, Avon, England
来源
基金
英国工程与自然科学研究理事会;
关键词
D O I
10.1007/978-3-031-26316-3_27
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we re-examine the task of cross-modal clip-sentence retrieval, where the clip is part of a longer untrimmed video. When the clip is short or visually ambiguous, knowledge of its local temporal context (i.e. surrounding video segments) can be used to improve the retrieval performance. We propose Context Transformer (ConTra); an encoder architecture that models the interaction between a video clip and its local temporal context in order to enhance its embedded representations. Importantly, we supervise the context transformer using contrastive losses in the cross-modal embedding space. We explore context transformers for video and text modalities. Results consistently demonstrate improved performance on three datasets: YouCook2, EPIC-KITCHENS and a clip-sentence version of ActivityNet Captions. Exhaustive ablation studies and context analysis show the efficacy of the proposed method.
引用
收藏
页码:451 / 468
页数:18
相关论文
共 50 条
  • [1] Survey on Video-Text Cross-Modal Retrieval
    Chen, Lei
    Xi, Yimeng
    Liu, Libo
    Computer Engineering and Applications, 2024, 60 (04) : 1 - 20
  • [2] CRET: Cross-Modal Retrieval Transformer for Efficient Text-Video Retrieval
    Ji, Kaixiang
    Liu, Jiajia
    Hong, Weixiang
    Zhong, Liheng
    Wang, Jian
    Chen, Jingdong
    Chu, Wei
    PROCEEDINGS OF THE 45TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '22), 2022, : 949 - 959
  • [3] Relation Triplet Construction for Cross-modal Text-to-Video Retrieval
    Song, Xue
    Chen, Jingjing
    Jiang, Yu-Gang
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4759 - 4767
  • [4] Contrastive Transformer Cross-Modal Hashing for Video-Text Retrieval
    Shen, Xiaobo
    Huang, Qianxin
    Lan, Long
    Zheng, Yuhui
    PROCEEDINGS OF THE THIRTY-THIRD INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2024, 2024, : 1227 - 1235
  • [5] Cross-Modal Dynamic Networks for Video Moment Retrieval With Text Query
    Wang, Gongmian
    Xu, Xing
    Shen, Fumin
    Lu, Huimin
    Ji, Yanli
    Shen, Heng Tao
    IEEE TRANSACTIONS ON MULTIMEDIA, 2022, 24 : 1221 - 1232
  • [6] Cross-Modal Video Retrieval Model Based on Video- Text Dual Alignment
    Che, Zhanbin
    Guo, Huaili
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2024, 15 (02) : 303 - 311
  • [7] Multilevel Semantic Interaction Alignment for Video-Text Cross-Modal Retrieval
    Chen, Lei
    Deng, Zhen
    Liu, Libo
    Yin, Shibai
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (07) : 6559 - 6575
  • [8] Hierarchical Cross-Modal Graph Consistency Learning for Video-Text Retrieval
    Jin, Weike
    Zhao, Zhou
    Zhang, Pengcheng
    Zhu, Jieming
    He, Xiuqiang
    Zhuang, Yueting
    SIGIR '21 - PROCEEDINGS OF THE 44TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2021, : 1114 - 1124
  • [9] A cross-modal conditional mechanism based on attention for text-video retrieval
    Du, Wanru
    Jing, Xiaochuan
    Zhu, Quan
    Wang, Xiaoyin
    Liu, Xuan
    MATHEMATICAL BIOSCIENCES AND ENGINEERING, 2023, 20 (11) : 20073 - 20092
  • [10] CMMT: Cross-Modal Meta-Transformer for Video-Text Retrieval
    Gao, Yizhao
    Lu, Zhiwu
    PROCEEDINGS OF THE 2023 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2023, 2023, : 76 - 84