Modality-Agnostic Self-Supervised Learning with Meta-Learned Masked Auto-Encoder

被引：0

作者：

Jang, Huiwon ^{[1
]}

Tack, Jihoon ^{[1
]}

Choi, Daewon ^{[2
]}

Jeong, Jongheon ^{[1
]}

Shin, Jinwoo ^{[1
]}

机构：

[1] Korea Adv Inst Sci & Technol KAIST, Daejeon, South Korea

[2] Korea Univ, Seoul, South Korea

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023) | 2023年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Despite its practical importance across a wide range of modalities, recent advances in self-supervised learning (SSL) have been primarily focused on a few well-curated domains, e.g., vision and language, often relying on their domain-specific knowledge. For example, Masked Auto-Encoder (MAE) has become one of the popular architectures in these domains, but less has explored its potential in other modalities. In this paper, we develop MAE as a unified, modality-agnostic SSL framework. In turn, we argue meta-learning as a key to interpreting MAE as a modality-agnostic learner, and propose enhancements to MAE from the motivation to jointly improve its SSL across diverse modalities, coined MetaMAE as a result. Our key idea is to view the mask reconstruction of MAE as a meta-learning task: masked tokens are predicted by adapting the Transformer meta-learner through the amortization of unmasked tokens. Based on this novel interpretation, we propose to integrate two advanced meta-learning techniques. First, we adapt the amortized latent of the Transformer encoder using gradient-based meta-learning to enhance the reconstruction. Then, we maximize the alignment between amortized and adapted latents through task contrastive learning which guides the Transformer encoder to better encode the task-specific knowledge. Our experiment demonstrates the superiority of MetaMAE in the modality-agnostic SSL benchmark (called DABS), significantly outperforming prior baselines. Code is available at https://github.com/alinlab/MetaMAE.

引用

页数：19

共 109 条

[1] [Anonymous], 2020, PMLR
[2] VQA: Visual Question Answering
Antol, Stanislaw
Agrawal, Aishwarya
Lu, Jiasen
Mitchell, Margaret
Batra, Dhruv
Zitnick, C. Lawrence
Parikh, Devi
[J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 2425 - 2433
[3] Auli M., 2023, INT C MACH LEARN
[4] Baevski A., 2020, wav2vec 2.0: A framework for self-supervised learning of speech representations
[5] Baevski A, 2022, PR MACH LEARN RES
[6] STORM-GAN: Spatio-Temporal Meta-GAN for Cross-City Estimation of Human Mobility Responses to COVID-
Bao, Han
Zhou, Xun
Xie, Yiqun
Li, Yanhua
Jia, Xiaowei
[J]. 2022 IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM), 2022, : 1 - 10
[7] Becker S., 2018, arXiv
[8] The Protein Data Bank
Berman, HM
Westbrook, J
Feng, Z
Gilliland, G
Bhat, TN
Weissig, H
Shindyalov, IN
Bourne, PE
[J]. NUCLEIC ACIDS RESEARCH, 2000, 28 (01) : 235 - 242
[9] Caron M., 2021, IEEE INT C COMP VIS
[10] Exploring Simple Siamese Representation Learning
Chen, Xinlei
He, Kaiming
[J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 15745 - 15753

← 1 2 3 4 5 6 7 8 9 10 →