Mask to Reconstruct: Cooperative Semantics Completion for Video-text Retrieval

被引：0

作者：

Fang, Han ^{[1
]}

Yang, Zhifei ^{[1
]}

Zang, Xianghao ^{[1
]}

Ban, Chao ^{[1
]}

He, Zhongjiang ^{[1
]}

Sun, Hao ^{[1
]}

Zhou, Lanxiang ^{[1
]}

机构：

[1] China Telecom Corp Ltd, Data&AI Technol Co, Hong Kong, Peoples R China

来源：

PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023 | 2023年

关键词：

Video-Text Retrieval; Mask Video Modeling; Attention;

D O I：

10.1145/3581783.3611756

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recently, masked video modeling has been widely explored and improved the model's understanding ability of visual regions at a local level. However, existing methods usually adopt random masking and follow the same reconstruction paradigm to complete the masked regions, which do not leverage the correlations between cross-modal content. In this paper, we present MAsk for Semantics COmpleTion (MASCOT) based on semantic-based masked modeling. Specifically, after applying attention-based video masking to generate high-informed and low-informed masks, we propose Informed Semantics Completion to recover masked semantics information. The recovery mechanism is achieved by aligning the masked content with the unmasked visual regions and corresponding textual context, which makes the model capture more text-related details at a patch level. Additionally, we shift the emphasis of reconstruction from irrelevant backgrounds to discriminative parts to ignore regions with low-informed masks. Furthermore, we design co-learning to incorporate video cues under different masks and learn more aligned representation. Our MASCOT performs state-of-the-art performance on four text-video retrieval benchmarks, including MSR-VTT, LSMDC, ActivityNet, and DiDeMo.

引用

页码：3847 / 3856

页数：10

共 49 条

[41] CLIP4Hashing: Unsupervised Deep Hashing for Cross-Modal Video-Text Retrieval
Zhuo, Yaoxin
Li, Yikang
Hsiao, Jenhao
Ho, Chiuman
Li, Baoxin
PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2022, 2022, : 158 - 166
[42] An empirical study of excitation and aggregation design adaptions in CLIP4Clip for video-text retrieval
Jing, Xiaolun
Yang, Genke
Chu, Jian
NEUROCOMPUTING, 2024, 596
[43] Deep Semantic Multimodal Hashing Network for Scalable Image-Text and Video-Text Retrievals
Jin, Lu
Li, Zechao
Tang, Jinhui
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, 34 (04) : 1838 - 1851
[44] Video-text retrieval via multi-modal masked transformer and adaptive attribute-aware graph convolutional network
Lv, Gang
Sun, Yining
Nian, Fudong
MULTIMEDIA SYSTEMS, 2024, 30 (01)
[45] Cross-Modal Dynamic Networks for Video Moment Retrieval With Text Query
Wang, Gongmian
Xu, Xing
Shen, Fumin
Lu, Huimin
Ji, Yanli
Shen, Heng Tao
IEEE TRANSACTIONS ON MULTIMEDIA, 2022, 24 : 1221 - 1232
[46] Sinkhorn Transformations for Single-Query Postprocessing in Text-Video Retrieval
Yakovlev, Konstantin
Polyakov, Gregory
Alimova, Ilseyar
Podolskiy, Alexander
Bout, Andrey
Nikolenko, Sergey
Piontkovskaya, Irina
PROCEEDINGS OF THE 46TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2023, 2023, : 2394 - 2398
[47] Language-enhanced object reasoning networks for video moment retrieval with text query
Wang, Gongmian
Jiang, Xun
Liu, Ning
Xu, Xing
COMPUTERS & ELECTRICAL ENGINEERING, 2022, 102
[48] Reading-Strategy Inspired Visual Representation Learning for Text-to-Video Retrieval
Dong, Jianfeng
Wang, Yabing
Chen, Xianke
Qu, Xiaoye
Li, Xirong
He, Yuan
Wang, Xun
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (08) : 5680 - 5694
[49] S2CA: Shared Concept Prototypes and Concept-level Alignment for text-video retrieval
Li, Yuxiao
Xin, Yu
Qian, Jiangbo
Dong, Yihong
NEUROCOMPUTING, 2025, 614

← 1 2 3 4 5 →