Masked Autoencoders Are Scalable Vision Learners

被引：4343

作者：

He, Kaiming ^{[1
]}

Chen, Xinlei ^{[1
]}

Xie, Saining ^{[1
]}

Li, Yanghao ^{[1
]}

Dollar, Piotr ^{[1
]}

Girshick, Ross ^{[1
]}

机构：

[1] Facebook AI Res FAIR, New York, NY 10003 USA

来源：

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022) | 2022年

关键词：

D O I：

10.1109/CVPR52688.2022.01553

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

This paper shows that masked autoencoders (MAE) are scalable self-supervised learners for computer vision. Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels. It is based on two core designs. First, we develop an asymmetric encoder-decoder architecture, with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens. Second, we find that masking a high proportion of the input image, e.g., 75%, yields a nontrivial and meaningful self-supervisory task. Coupling these two designs enables us to train large models efficiently and effectively: we accelerate training (by 3x or more) and improve accuracy. Our scalable approach allows for learning high-capacity models that generalize well: e.g., a vanilla ViT-Huge model achieves the best accuracy (87.8%) among methods that use only ImageNet-IK data. Transfer performance in downstream tasks outperforms supervised pretraining and shows promising scaling behavior.

引用

页码：15979 / 15988

页数：10

共 72 条

[1]

[Anonymous], 2017, CVPR, DOI DOI 10.1109/CVPR.2017.638

[2]

[Anonymous], 2014, NeurIPS

[3]

[Anonymous], 2018, ECCV, DOI DOI 10.1007/978-3-030-01249-611

[4]

Ba J. L., 2016, Advances in Neural Information Processing Systems (NeurIPS), P1

[5]

Bao Hangbo, 2021, PROC INT C LEARN REP

[6]

Becker Suzanna, 1992, NATURE

[7]

Brown TB, 2020, ADV NEUR IN, V33

[8] Emerging Properties in Self-Supervised Vision Transformers [J].

Caron, Mathilde ;

Touvron, Hugo ;

Misra, Ishan ;

Jegou, Herve ;

Mairal, Julien ;

Bojanowski, Piotr ;

Joulin, Armand .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :9630-9640

[9]

Chen M., 2020, ICML

[10]

Chen T., 2020, ICML

← 1 2 3 4 5 6 7 8 →