VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking

被引:134
作者
Wang, Limin [1 ,2 ]
Huang, Bingkun [1 ,2 ]
Zhao, Zhiyu [1 ,2 ]
Tong, Zhan [1 ]
He, Yinan [2 ]
Wang, Yi [2 ]
Wang, Yali [2 ,3 ]
Qiao, Yu [2 ,3 ]
机构
[1] Nanjing Univ, State Key Lab Novel Software Technol, Nanjing, Peoples R China
[2] Shanghai AI Lab, Shanghai, Peoples R China
[3] Chinese Acad Sci, Shenzhen Inst Adv Technol, Shenzhen, Peoples R China
来源
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2023年
基金
国家重点研发计划; 中国国家自然科学基金;
关键词
D O I
10.1109/CVPR52729.2023.01398
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Scale is the primary factor for building a powerful foundation model that could well generalize to a variety of downstream tasks. However, it is still challenging to train video foundation models with billions of parameters. This paper shows that video masked autoencoder (VideoMAE) is a scalable and general self-supervised pre-trainer for building video foundation models. We scale the VideoMAE in both model and data with a core design. Specifically, we present a dual masking strategy for efficient pre-training, with an encoder operating on a subset of video tokens and a decoder processing another subset of video tokens. Although VideoMAE is very efficient due to high masking ratio in encoder, masking decoder can still further reduce the overall computational cost. This enables the efficient pre-training of billion-level models in video. We also use a progressive training paradigm that involves an initial pre-training on a diverse multi-sourced unlabeled dataset, followed by a post-pre-training on a mixed labeled dataset. Finally, we successfully train a video ViT model with a billion parameters, which achieves a new state-of-the-art performance on the datasets of Kinetics (90.0% on K400 and 89.9% on K600) and Something-Something (68.7% on V1 and 77.0% on V2). In addition, we extensively verify the pre-trained video ViT models on a variety of downstream tasks, demonstrating its effectiveness as a general video representation learner.
引用
收藏
页码:14549 / 14560
页数:12
相关论文
共 89 条
[1]  
[Anonymous], 2019, NeurIPS
[2]  
[Anonymous], 2022, CVPR, DOI DOI 10.1109/CVPR52688.2022.01179
[3]  
[Anonymous], 2022, ECCV, DOI DOI 10.1007/978-3-031-19830-429
[4]  
[Anonymous], 2020, ICML
[5]  
[Anonymous], 2021, CVPR, DOI DOI 10.1109/CVPR46437.2021.00333
[6]  
[Anonymous], 2021, ICML
[7]  
[Anonymous], 2022, CVPR, DOI DOI 10.1109/CVPR52688.2022.01426
[8]   ViViT: A Video Vision Transformer [J].
Arnab, Anurag ;
Dehghani, Mostafa ;
Heigold, Georg ;
Sun, Chen ;
Lucic, Mario ;
Schmid, Cordelia .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :6816-6826
[9]   Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval [J].
Bain, Max ;
Nagrani, Arsha ;
Varol, Gul ;
Zisserman, Andrew .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :1708-1718
[10]  
Bao H., 2022, ICLR