SMAUG: Sparse Masked Autoencoder for Efficient Video-Language Pre-training

被引:3
作者
Lin, Yuanze [1 ]
Wei, Chen [2 ]
Wang, Huiyu [2 ]
Yuille, Alan [2 ]
Xie, Cihang [3 ]
机构
[1] Univ Washington, Seattle, WA 98195 USA
[2] Johns Hopkins Univ, Baltimore, MD 21218 USA
[3] UC Santa Cruz, Santa Cruz, CA USA
来源
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV | 2023年
关键词
D O I
10.1109/ICCV51070.2023.00233
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video-language pre-training is crucial for learning powerful multi-modal representation. However, it typically requires a massive amount of computation. In this paper, we develop SMAUG, an efficient pre-training framework for video-language models. The foundation component in SMAUG is masked autoencoders. Different from prior works which only mask textual inputs, our masking strategy considers both visual and textual modalities, providing a better cross-modal alignment and saving more pre-training costs. On top of that, we introduce a space-time token sparsification module, which leverages context information to further select only "important" spatial regions and temporal frames for pre-training. Coupling all these designs allows our method to enjoy both competitive performances on text-to-video retrieval and video question answering tasks, and much less pre-training costs by 1.9x or more. For example, our SMAUG only needs similar to 50 NVIDIA A6000 GPU hours for pre-training to attain competitive performances on these two video-language tasks across six popular benchmarks.
引用
收藏
页码:2459 / 2469
页数:11
相关论文
共 69 条
  • [1] [Anonymous], 2022, P IEEE CVF C COMP VI, DOI DOI 10.1109/SPIES55999.2022.10082039
  • [2] [Anonymous], 2020, INT C MACH LEARN
  • [3] Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval
    Bain, Max
    Nagrani, Arsha
    Varol, Gul
    Zisserman, Andrew
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 1708 - 1718
  • [4] Bao H., 2021, PROC INT C LEARN REP
  • [5] Revisiting the "Video" in Video-Language Understanding
    Buch, Shyamal
    Eyzaguirre, Cristobal
    Gaidon, Adrien
    Wu, Jiajun
    Li Fei-Fei
    Niebles, Juan Carlos
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 2907 - 2917
  • [6] Heilbron FC, 2015, PROC CVPR IEEE, P961, DOI 10.1109/CVPR.2015.7298698
  • [7] Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts
    Changpinyo, Soravit
    Sharma, Piyush
    Ding, Nan
    Soricut, Radu
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 3557 - 3567
  • [8] Dosovitskiy A., 2020, PREPRINT
  • [9] An Empirical Study of Training End-to-End Vision-and-Language Transformers
    Dou, Zi-Yi
    Xu, Yichong
    Gan, Zhe
    Wang, Jianfeng
    Wang, Shuohang
    Wang, Lijuan
    Zhu, Chenguang
    Zhang, Pengchuan
    Yuan, Lu
    Peng, Nanyun
    Liu, Zicheng
    Zeng, Michael
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 18145 - 18155
  • [10] MDMMT: Multidomain Multimodal Transformer for Video Retrieval
    Dzabraev, Maksim
    Kalashnikov, Maksim
    Komkov, Stepan
    Petiushko, Aleksandr
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2021, 2021, : 3349 - 3358