SMAUG: Sparse Masked Autoencoder for Efficient Video-Language Pre-training

被引：3

作者：

Lin, Yuanze ^{[1
]}

Wei, Chen ^{[2
]}

Wang, Huiyu ^{[2
]}

Yuille, Alan ^{[2
]}

Xie, Cihang ^{[3
]}

机构：

[1] Univ Washington, Seattle, WA 98195 USA

[2] Johns Hopkins Univ, Baltimore, MD 21218 USA

[3] UC Santa Cruz, Santa Cruz, CA USA

来源：

2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV | 2023年

关键词：

D O I：

10.1109/ICCV51070.2023.00233

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Video-language pre-training is crucial for learning powerful multi-modal representation. However, it typically requires a massive amount of computation. In this paper, we develop SMAUG, an efficient pre-training framework for video-language models. The foundation component in SMAUG is masked autoencoders. Different from prior works which only mask textual inputs, our masking strategy considers both visual and textual modalities, providing a better cross-modal alignment and saving more pre-training costs. On top of that, we introduce a space-time token sparsification module, which leverages context information to further select only "important" spatial regions and temporal frames for pre-training. Coupling all these designs allows our method to enjoy both competitive performances on text-to-video retrieval and video question answering tasks, and much less pre-training costs by 1.9x or more. For example, our SMAUG only needs similar to 50 NVIDIA A6000 GPU hours for pre-training to attain competitive performances on these two video-language tasks across six popular benchmarks.

引用

页码：2459 / 2469

页数：11

共 69 条

[1] [Anonymous], 2022, P IEEE CVF C COMP VI, DOI DOI 10.1109/SPIES55999.2022.10082039
[2] [Anonymous], 2020, INT C MACH LEARN
[3] Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval
Bain, Max
Nagrani, Arsha
Varol, Gul
Zisserman, Andrew
[J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 1708 - 1718
[4] Bao H., 2021, PROC INT C LEARN REP
[5] Revisiting the "Video" in Video-Language Understanding
Buch, Shyamal
Eyzaguirre, Cristobal
Gaidon, Adrien
Wu, Jiajun
Li Fei-Fei
Niebles, Juan Carlos
[J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 2907 - 2917
[6] Heilbron FC, 2015, PROC CVPR IEEE, P961, DOI 10.1109/CVPR.2015.7298698
[7] Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts
Changpinyo, Soravit
Sharma, Piyush
Ding, Nan
Soricut, Radu
[J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 3557 - 3567
[8] Dosovitskiy A., 2020, PREPRINT
[9] An Empirical Study of Training End-to-End Vision-and-Language Transformers
Dou, Zi-Yi
Xu, Yichong
Gan, Zhe
Wang, Jianfeng
Wang, Shuohang
Wang, Lijuan
Zhu, Chenguang
Zhang, Pengchuan
Yuan, Lu
Peng, Nanyun
Liu, Zicheng
Zeng, Michael
[J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 18145 - 18155
[10] MDMMT: Multidomain Multimodal Transformer for Video Retrieval
Dzabraev, Maksim
Kalashnikov, Maksim
Komkov, Stepan
Petiushko, Aleksandr
[J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2021, 2021, : 3349 - 3358

← 1 2 3 4 5 6 7 →