AdaMAE: Adaptive Masking for Efficient Spatiotemporal Learning with Masked Autoencoders

被引:11
作者
Bandara, Wele Gedara Chaminda [1 ]
Patel, Naman [2 ]
Gholami, Ali [2 ]
Nikkhah, Mehdi [2 ]
Agrawal, Motilal [2 ]
Patel, Vishal M. [1 ]
机构
[1] Johns Hopkins Univ, Baltimore, MD 21218 USA
[2] Zippin, San Francisco, CA USA
来源
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2023年
关键词
D O I
10.1109/CVPR52729.2023.01394
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Masked Autoencoders (MAEs) learn generalizable representations for image, text, audio, video, etc., by reconstructing masked input data from tokens of the visible data. Current MAE approaches for videos rely on random patch, tube, or frame based masking strategies to select these tokens. This paper proposes AdaMAE, an adaptive masking strategy for MAEs that is end-to-end trainable. Our adaptive masking strategy samples visible tokens based on the semantic context using an auxiliary sampling network. This network estimates a categorical distribution over spacetime-patch tokens. The tokens that increase the expected reconstruction error are rewarded and selected as visible tokens, motivated by the policy gradient algorithm in reinforcement learning. We show that AdaMAE samples more tokens from the high spatiotemporal information regions, thereby allowing us to mask 95% of tokens, resulting in lower memory requirements and faster pre-training. We conduct ablation studies on the Something-Something v2 (SSv2) dataset to demonstrate the efficacy of our adaptive sampling approach and report state-of-the-art results of 70.0% and 81.7% in top-1 accuracy on SSv2 and Kinetics-400 action classification datasets with a ViT-Base backbone and 800 pre-training epochs. Code and pre-trained models are available at: https://github.com/wgcban/adamae.git.
引用
收藏
页码:14507 / 14517
页数:11
相关论文
共 66 条
  • [1] [Anonymous], AAAI CONF ARTIF INTE
  • [2] ViViT: A Video Vision Transformer
    Arnab, Anurag
    Dehghani, Mostafa
    Heigold, Georg
    Sun, Chen
    Lucic, Mario
    Schmid, Cordelia
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 6816 - 6826
  • [3] A TRANSFORMER-BASED SIAMESE NETWORK FOR CHANGE DETECTION
    Bandara, Wele Gedara Chaminda
    Patel, Vishal M.
    [J]. 2022 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM (IGARSS 2022), 2022, : 207 - 210
  • [4] Bandara Wele Gedara Chaminda, 2022, ARXIV220611892
  • [5] STORM-GAN: Spatio-Temporal Meta-GAN for Cross-City Estimation of Human Mobility Responses to COVID-
    Bao, Han
    Zhou, Xun
    Xie, Yiqun
    Li, Yanhua
    Jia, Xiaowei
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM), 2022, : 1 - 10
  • [6] Bertasius Gedas, 2021, P INT C MACH LEARN, V139, P813
  • [7] Chen Jun, 2022, ARXIV220600790 CS
  • [8] Chen M, 2020, PR MACH LEARN RES, V119
  • [9] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
  • [10] Dosovitskiy Alexey, 2020, INT C LEARN REPR ICL, DOI DOI 10.48550/ARXIV.2010.11929