Mask to Reconstruct: Cooperative Semantics Completion for Video-text Retrieval

被引：0

作者：

Fang, Han ^{[1
]}

Yang, Zhifei ^{[1
]}

Zang, Xianghao ^{[1
]}

Ban, Chao ^{[1
]}

He, Zhongjiang ^{[1
]}

Sun, Hao ^{[1
]}

Zhou, Lanxiang ^{[1
]}

机构：

[1] China Telecom Corp Ltd, Data&AI Technol Co, Hong Kong, Peoples R China

来源：

PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023 | 2023年

关键词：

Video-Text Retrieval; Mask Video Modeling; Attention;

D O I：

10.1145/3581783.3611756

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recently, masked video modeling has been widely explored and improved the model's understanding ability of visual regions at a local level. However, existing methods usually adopt random masking and follow the same reconstruction paradigm to complete the masked regions, which do not leverage the correlations between cross-modal content. In this paper, we present MAsk for Semantics COmpleTion (MASCOT) based on semantic-based masked modeling. Specifically, after applying attention-based video masking to generate high-informed and low-informed masks, we propose Informed Semantics Completion to recover masked semantics information. The recovery mechanism is achieved by aligning the masked content with the unmasked visual regions and corresponding textual context, which makes the model capture more text-related details at a patch level. Additionally, we shift the emphasis of reconstruction from irrelevant backgrounds to discriminative parts to ignore regions with low-informed masks. Furthermore, we design co-learning to incorporate video cues under different masks and learn more aligned representation. Our MASCOT performs state-of-the-art performance on four text-video retrieval benchmarks, including MSR-VTT, LSMDC, ActivityNet, and DiDeMo.

引用

页码：3847 / 3856

页数：10

共 54 条

[1] ViViT: A Video Vision Transformer [J].

Arnab, Anurag ;

Dehghani, Mostafa ;

Heigold, Georg ;

Sun, Chen ;

Lucic, Mario ;

Schmid, Cordelia .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :6816-6826

[2] Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval [J].

Bain, Max ;

Nagrani, Arsha ;

Varol, Gul ;

Zisserman, Andrew .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :1708-1718

[3]

Bao Hangbo, 2021, PROC INT C LEARN REP

[4]

Heilbron FC, 2015, PROC CVPR IEEE, P961, DOI 10.1109/CVPR.2015.7298698

[5]

Chen D., 2011, ACL, P190

[6] Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning [J].

Chen, Shizhe ;

Zhao, Yida ;

Jin, Qin ;

Wu, Qi .

2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :10635-10644

[7]

Cheng Xing, 2021, ARXIV210904290

[8] TEACHTEXT: CrossModal Generalized Distillation for Text-Video Retrieval [J].

Croitoru, Ioana ;

Bogolin, Simion-Vlad ;

Leordeanu, Marius ;

Jin, Hailin ;

Zisserman, Andrew ;

Albanie, Samuel ;

Liu, Yang .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :11563-11573

[9]

Dosovitskiy A., 2020, ICLR 2021

[10] MDMMT: Multidomain Multimodal Transformer for Video Retrieval [J].

Dzabraev, Maksim ;

Kalashnikov, Maksim ;

Komkov, Stepan ;

Petiushko, Aleksandr .

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2021, 2021, :3349-3358

← 1 2 3 4 5 6 →