Multi-View Masked Autoencoder for General Image Representation

被引:1
作者
Ji, Seungbin [1 ]
Han, Sangkwon [1 ]
Rhee, Jongtae [1 ]
机构
[1] Dongguk Univ, Dept Ind & Syst Engn, Seoul 04620, South Korea
来源
APPLIED SCIENCES-BASEL | 2023年 / 13卷 / 22期
关键词
contrastive learning; deep learning; image representation learning; masked image modeling; self-supervised learning;
D O I
10.3390/app132212413
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
Self-supervised learning is a method that learns general representation from unlabeled data. Masked image modeling (MIM), one of the generative self-supervised learning methods, has drawn attention for showing state-of-the-art performance on various downstream tasks, though it has shown poor linear separability resulting from the token-level approach. In this paper, we propose a contrastive learning-based multi-view masked autoencoder for MIM, thus exploiting an image-level approach by learning common features from two different augmented views. We strengthen the MIM by learning long-range global patterns from contrastive loss. Our framework adopts a simple encoder-decoder architecture, thus learning rich and general representations by following a simple process: (1) Two different views are generated from an input image with random masking and by contrastive loss, we can learn the semantic distance of the representations generated by an encoder. By applying a high mask ratio, of 80%, it works as strong augmentation and alleviates the representation collapse problem. (2) With reconstruction loss, the decoder learns to reconstruct an original image from the masked image. We assessed our framework through several experiments on benchmark datasets of image classification, object detection, and semantic segmentation. We achieved 84.3% in fine-tuning accuracy on ImageNet-1K classification and 76.7% in linear probing, thus exceeding previous studies and showing promising results on other downstream tasks. The experimental results demonstrate that our work can learn rich and general image representation by applying contrastive loss to masked image modeling.
引用
收藏
页数:15
相关论文
共 51 条
  • [1] Abnar S, 2020, 58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), P4190
  • [2] Bank D., 2023, MachineLearning for Data Science Handbook: Data Mining and KnowledgeDiscovery Handbook, P353
  • [3] Bao H., 2021, arXiv, DOI DOI 10.48550/ARXIV.2106.08254
  • [4] Caron M, 2020, ADV NEUR IN, V33
  • [5] Emerging Properties in Self-Supervised Vision Transformers
    Caron, Mathilde
    Touvron, Hugo
    Misra, Ishan
    Jegou, Herve
    Mairal, Julien
    Bojanowski, Piotr
    Joulin, Armand
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 9630 - 9640
  • [6] Chen M, 2020, PR MACH LEARN RES, V119
  • [7] Chen T, 2020, PR MACH LEARN RES, V119
  • [8] Context Autoencoder for Self-supervised Representation Learning
    Chen, Xiaokang
    Ding, Mingyu
    Wang, Xiaodi
    Xin, Ying
    Mo, Shentong
    Wang, Yunhao
    Han, Shumin
    Luo, Ping
    Zeng, Gang
    Wang, Jingdong
    [J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2023, 132 (1) : 208 - 223
  • [9] Chen XL, 2021, Arxiv, DOI arXiv:2104.02057
  • [10] Chen XL, 2020, Arxiv, DOI [arXiv:2003.04297, DOI 10.48550/ARXIV.2003.04297]