ConTra: (Con)text (Tra)nsformer for Cross-Modal Video Retrieval

被引：0

作者：

Fragomeni, Adriano ^{[1
]}

Wray, Michael ^{[1
]}

Damen, Dima ^{[1
]}

机构：

[1] Univ Bristol, Dept Comp Sci, Bristol, Avon, England

来源：

COMPUTER VISION - ACCV 2022, PT IV | 2023年 / 13844卷

基金：

英国工程与自然科学研究理事会;

关键词：

D O I：

10.1007/978-3-031-26316-3_27

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In this paper, we re-examine the task of cross-modal clip-sentence retrieval, where the clip is part of a longer untrimmed video. When the clip is short or visually ambiguous, knowledge of its local temporal context (i.e. surrounding video segments) can be used to improve the retrieval performance. We propose Context Transformer (ConTra); an encoder architecture that models the interaction between a video clip and its local temporal context in order to enhance its embedded representations. Importantly, we supervise the context transformer using contrastive losses in the cross-modal embedding space. We explore context transformers for video and text modalities. Results consistently demonstrate improved performance on three datasets: YouCook2, EPIC-KITCHENS and a clip-sentence version of ActivityNet Captions. Exhaustive ablation studies and context analysis show the efficacy of the proposed method.

引用

页码：451 / 468

页数：18

共 50 条

[1] Survey on Video-Text Cross-Modal Retrieval
Chen, Lei
Xi, Yimeng
Liu, Libo
Computer Engineering and Applications, 2024, 60 (04) : 1 - 20
[2] CRET: Cross-Modal Retrieval Transformer for Efficient Text-Video Retrieval
Ji, Kaixiang
Liu, Jiajia
Hong, Weixiang
Zhong, Liheng
Wang, Jian
Chen, Jingdong
Chu, Wei
PROCEEDINGS OF THE 45TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '22), 2022, : 949 - 959
[3] Relation Triplet Construction for Cross-modal Text-to-Video Retrieval
Song, Xue
Chen, Jingjing
Jiang, Yu-Gang
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4759 - 4767
[4] Contrastive Transformer Cross-Modal Hashing for Video-Text Retrieval
Shen, Xiaobo
Huang, Qianxin
Lan, Long
Zheng, Yuhui
PROCEEDINGS OF THE THIRTY-THIRD INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2024, 2024, : 1227 - 1235
[5] Cross-Modal Dynamic Networks for Video Moment Retrieval With Text Query
Wang, Gongmian
Xu, Xing
Shen, Fumin
Lu, Huimin
Ji, Yanli
Shen, Heng Tao
IEEE TRANSACTIONS ON MULTIMEDIA, 2022, 24 : 1221 - 1232
[6] Cross-Modal Video Retrieval Model Based on Video- Text Dual Alignment
Che, Zhanbin
Guo, Huaili
INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2024, 15 (02) : 303 - 311
[7] Multilevel Semantic Interaction Alignment for Video-Text Cross-Modal Retrieval
Chen, Lei
Deng, Zhen
Liu, Libo
Yin, Shibai
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (07) : 6559 - 6575
[8] Hierarchical Cross-Modal Graph Consistency Learning for Video-Text Retrieval
Jin, Weike
Zhao, Zhou
Zhang, Pengcheng
Zhu, Jieming
He, Xiuqiang
Zhuang, Yueting
SIGIR '21 - PROCEEDINGS OF THE 44TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2021, : 1114 - 1124
[9] A cross-modal conditional mechanism based on attention for text-video retrieval
Du, Wanru
Jing, Xiaochuan
Zhu, Quan
Wang, Xiaoyin
Liu, Xuan
MATHEMATICAL BIOSCIENCES AND ENGINEERING, 2023, 20 (11) : 20073 - 20092
[10] CMMT: Cross-Modal Meta-Transformer for Video-Text Retrieval
Gao, Yizhao
Lu, Zhiwu
PROCEEDINGS OF THE 2023 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2023, 2023, : 76 - 84

← 1 2 3 4 5 →