Mask to Reconstruct: Cooperative Semantics Completion for Video-text Retrieval

被引:0
|
作者
Fang, Han [1 ]
Yang, Zhifei [1 ]
Zang, Xianghao [1 ]
Ban, Chao [1 ]
He, Zhongjiang [1 ]
Sun, Hao [1 ]
Zhou, Lanxiang [1 ]
机构
[1] China Telecom Corp Ltd, Data&AI Technol Co, Hong Kong, Peoples R China
来源
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023 | 2023年
关键词
Video-Text Retrieval; Mask Video Modeling; Attention;
D O I
10.1145/3581783.3611756
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recently, masked video modeling has been widely explored and improved the model's understanding ability of visual regions at a local level. However, existing methods usually adopt random masking and follow the same reconstruction paradigm to complete the masked regions, which do not leverage the correlations between cross-modal content. In this paper, we present MAsk for Semantics COmpleTion (MASCOT) based on semantic-based masked modeling. Specifically, after applying attention-based video masking to generate high-informed and low-informed masks, we propose Informed Semantics Completion to recover masked semantics information. The recovery mechanism is achieved by aligning the masked content with the unmasked visual regions and corresponding textual context, which makes the model capture more text-related details at a patch level. Additionally, we shift the emphasis of reconstruction from irrelevant backgrounds to discriminative parts to ignore regions with low-informed masks. Furthermore, we design co-learning to incorporate video cues under different masks and learn more aligned representation. Our MASCOT performs state-of-the-art performance on four text-video retrieval benchmarks, including MSR-VTT, LSMDC, ActivityNet, and DiDeMo.
引用
收藏
页码:3847 / 3856
页数:10
相关论文
共 49 条
  • [41] CLIP4Hashing: Unsupervised Deep Hashing for Cross-Modal Video-Text Retrieval
    Zhuo, Yaoxin
    Li, Yikang
    Hsiao, Jenhao
    Ho, Chiuman
    Li, Baoxin
    PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2022, 2022, : 158 - 166
  • [42] An empirical study of excitation and aggregation design adaptions in CLIP4Clip for video-text retrieval
    Jing, Xiaolun
    Yang, Genke
    Chu, Jian
    NEUROCOMPUTING, 2024, 596
  • [43] Deep Semantic Multimodal Hashing Network for Scalable Image-Text and Video-Text Retrievals
    Jin, Lu
    Li, Zechao
    Tang, Jinhui
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, 34 (04) : 1838 - 1851
  • [44] Video-text retrieval via multi-modal masked transformer and adaptive attribute-aware graph convolutional network
    Lv, Gang
    Sun, Yining
    Nian, Fudong
    MULTIMEDIA SYSTEMS, 2024, 30 (01)
  • [45] Cross-Modal Dynamic Networks for Video Moment Retrieval With Text Query
    Wang, Gongmian
    Xu, Xing
    Shen, Fumin
    Lu, Huimin
    Ji, Yanli
    Shen, Heng Tao
    IEEE TRANSACTIONS ON MULTIMEDIA, 2022, 24 : 1221 - 1232
  • [46] Sinkhorn Transformations for Single-Query Postprocessing in Text-Video Retrieval
    Yakovlev, Konstantin
    Polyakov, Gregory
    Alimova, Ilseyar
    Podolskiy, Alexander
    Bout, Andrey
    Nikolenko, Sergey
    Piontkovskaya, Irina
    PROCEEDINGS OF THE 46TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2023, 2023, : 2394 - 2398
  • [47] Language-enhanced object reasoning networks for video moment retrieval with text query
    Wang, Gongmian
    Jiang, Xun
    Liu, Ning
    Xu, Xing
    COMPUTERS & ELECTRICAL ENGINEERING, 2022, 102
  • [48] Reading-Strategy Inspired Visual Representation Learning for Text-to-Video Retrieval
    Dong, Jianfeng
    Wang, Yabing
    Chen, Xianke
    Qu, Xiaoye
    Li, Xirong
    He, Yuan
    Wang, Xun
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (08) : 5680 - 5694
  • [49] S2CA: Shared Concept Prototypes and Concept-level Alignment for text-video retrieval
    Li, Yuxiao
    Xin, Yu
    Qian, Jiangbo
    Dong, Yihong
    NEUROCOMPUTING, 2025, 614