MAE-AST: Masked Autoencoding Audio Spectrogram Transformer

被引：46

作者：

Baade, Alan ^{[1
]}

Peng, Puyuan ^{[1
]}

Harwath, David ^{[1
]}

机构：

[1] Univ Texas Austin, Dept Comp Sci, Austin, TX 78712 USA

来源：

INTERSPEECH 2022 | 2022年

关键词：

audio classification; self-attention; Transformer; self-supervised;

D O I：

10.21437/Interspeech.2022-10961

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

In this paper, we propose a simple yet powerful improvement over the recent Self-Supervised Audio Spectrogram Transformer (SSAST) model for speech and audio classification. Specifically, we leverage the insight that the SSAST uses a very high masking ratio (75%) during pretraining, meaning that the vast majority of self-attention compute is performed on mask tokens. We address this by integrating the encoder-decoder architecture from Masked Autoencoders are Scalable Vision Learners (MAE) into the SSAST, where a deep encoder operates on only unmasked input, and a shallow decoder operates on encoder outputs and mask tokens. We find that MAE-like pretraining can provide a 3x speedup and 2x memory usage reduction over the vanilla SSAST using current audio pretraining strategies with ordinary model and input sizes. When fine-tuning on downstream tasks, which only uses the encoder, we find that our approach outperforms the SSAST on a variety of downstream tasks. We further conduct comprehensive evaluations into different strategies of pretraining and explore differences in MAE-style pretraining between the visual and audio domains. Code at https://github.com/AlanBaade/MAEAST-Public

引用

页码：2438 / 2442

页数：5

共 22 条

[1]

[Anonymous], 2017, INT CONF ACOUST SPEE

[2]

Baevski A., 2020, wav2vec 2.0: A framework for self-supervised learning of speech representations

[3]

Brown, 2020, ARXIV

[4] IEMOCAP: interactive emotional dyadic motion capture database [J].

Busso, Carlos ;

Bulut, Murtaza ;

Lee, Chi-Chun ;

Kazemzadeh, Abe ;

Mower, Emily ;

Kim, Samuel ;

Chang, Jeannette N. ;

Lee, Sungbok ;

Narayanan, Shrikanth S. .

LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) :335-359

[5]

Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171

[6]

Dosovitskiy A, 2020, ARXIV

[7] AST: Audio Spectrogram Transformer [J].

Gong, Yuan ;

Chung, Yu-An ;

Glass, James .

INTERSPEECH 2021, 2021, :571-575

[8]

Gong Yuan, 2021, ARXIV211009784

[9]

He Kaiming, 2021, Masked autoencoders are scalable vision learners

[10] HUBERT: HOW MUCH CAN A BAD TEACHER BENEFIT ASR PRE-TRAINING? [J].

Hsu, Wei-Ning ;

Tsai, Yao-Hung Hubert ;

Bolte, Benjamin ;

Salakhutdinov, Ruslan ;

Mohamed, Abdelrahman .

2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, :6533-6537

← 1 2 3 →