Modality-Agnostic Self-Supervised Learning with Meta-Learned Masked Auto-Encoder

被引:0
作者
Jang, Huiwon [1 ]
Tack, Jihoon [1 ]
Choi, Daewon [2 ]
Jeong, Jongheon [1 ]
Shin, Jinwoo [1 ]
机构
[1] Korea Adv Inst Sci & Technol KAIST, Daejeon, South Korea
[2] Korea Univ, Seoul, South Korea
来源
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023) | 2023年
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Despite its practical importance across a wide range of modalities, recent advances in self-supervised learning (SSL) have been primarily focused on a few well-curated domains, e.g., vision and language, often relying on their domain-specific knowledge. For example, Masked Auto-Encoder (MAE) has become one of the popular architectures in these domains, but less has explored its potential in other modalities. In this paper, we develop MAE as a unified, modality-agnostic SSL framework. In turn, we argue meta-learning as a key to interpreting MAE as a modality-agnostic learner, and propose enhancements to MAE from the motivation to jointly improve its SSL across diverse modalities, coined MetaMAE as a result. Our key idea is to view the mask reconstruction of MAE as a meta-learning task: masked tokens are predicted by adapting the Transformer meta-learner through the amortization of unmasked tokens. Based on this novel interpretation, we propose to integrate two advanced meta-learning techniques. First, we adapt the amortized latent of the Transformer encoder using gradient-based meta-learning to enhance the reconstruction. Then, we maximize the alignment between amortized and adapted latents through task contrastive learning which guides the Transformer encoder to better encode the task-specific knowledge. Our experiment demonstrates the superiority of MetaMAE in the modality-agnostic SSL benchmark (called DABS), significantly outperforming prior baselines. Code is available at https://github.com/alinlab/MetaMAE.
引用
收藏
页数:19
相关论文
共 109 条
  • [1] [Anonymous], 2020, PMLR
  • [2] VQA: Visual Question Answering
    Antol, Stanislaw
    Agrawal, Aishwarya
    Lu, Jiasen
    Mitchell, Margaret
    Batra, Dhruv
    Zitnick, C. Lawrence
    Parikh, Devi
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 2425 - 2433
  • [3] Auli M., 2023, INT C MACH LEARN
  • [4] Baevski A., 2020, wav2vec 2.0: A framework for self-supervised learning of speech representations
  • [5] Baevski A, 2022, PR MACH LEARN RES
  • [6] STORM-GAN: Spatio-Temporal Meta-GAN for Cross-City Estimation of Human Mobility Responses to COVID-
    Bao, Han
    Zhou, Xun
    Xie, Yiqun
    Li, Yanhua
    Jia, Xiaowei
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM), 2022, : 1 - 10
  • [7] Becker S., 2018, arXiv
  • [8] The Protein Data Bank
    Berman, HM
    Westbrook, J
    Feng, Z
    Gilliland, G
    Bhat, TN
    Weissig, H
    Shindyalov, IN
    Bourne, PE
    [J]. NUCLEIC ACIDS RESEARCH, 2000, 28 (01) : 235 - 242
  • [9] Caron M., 2021, IEEE INT C COMP VIS
  • [10] Exploring Simple Siamese Representation Learning
    Chen, Xinlei
    He, Kaiming
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 15745 - 15753