Masked Autoencoders Are Scalable Vision Learners

被引:4343
作者
He, Kaiming [1 ]
Chen, Xinlei [1 ]
Xie, Saining [1 ]
Li, Yanghao [1 ]
Dollar, Piotr [1 ]
Girshick, Ross [1 ]
机构
[1] Facebook AI Res FAIR, New York, NY 10003 USA
来源
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022) | 2022年
关键词
D O I
10.1109/CVPR52688.2022.01553
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper shows that masked autoencoders (MAE) are scalable self-supervised learners for computer vision. Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels. It is based on two core designs. First, we develop an asymmetric encoder-decoder architecture, with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens. Second, we find that masking a high proportion of the input image, e.g., 75%, yields a nontrivial and meaningful self-supervisory task. Coupling these two designs enables us to train large models efficiently and effectively: we accelerate training (by 3x or more) and improve accuracy. Our scalable approach allows for learning high-capacity models that generalize well: e.g., a vanilla ViT-Huge model achieves the best accuracy (87.8%) among methods that use only ImageNet-IK data. Transfer performance in downstream tasks outperforms supervised pretraining and shows promising scaling behavior.
引用
收藏
页码:15979 / 15988
页数:10
相关论文
共 72 条
[1]  
[Anonymous], 2017, CVPR, DOI DOI 10.1109/CVPR.2017.638
[2]  
[Anonymous], 2014, NeurIPS
[3]  
[Anonymous], 2018, ECCV, DOI DOI 10.1007/978-3-030-01249-611
[4]  
Ba J. L., 2016, Advances in Neural Information Processing Systems (NeurIPS), P1
[5]  
Bao Hangbo, 2021, PROC INT C LEARN REP
[6]  
Becker Suzanna, 1992, NATURE
[7]  
Brown TB, 2020, ADV NEUR IN, V33
[8]   Emerging Properties in Self-Supervised Vision Transformers [J].
Caron, Mathilde ;
Touvron, Hugo ;
Misra, Ishan ;
Jegou, Herve ;
Mairal, Julien ;
Bojanowski, Piotr ;
Joulin, Armand .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :9630-9640
[9]  
Chen M., 2020, ICML
[10]  
Chen T., 2020, ICML