MAE-AST: Masked Autoencoding Audio Spectrogram Transformer

被引:46
作者
Baade, Alan [1 ]
Peng, Puyuan [1 ]
Harwath, David [1 ]
机构
[1] Univ Texas Austin, Dept Comp Sci, Austin, TX 78712 USA
来源
INTERSPEECH 2022 | 2022年
关键词
audio classification; self-attention; Transformer; self-supervised;
D O I
10.21437/Interspeech.2022-10961
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
In this paper, we propose a simple yet powerful improvement over the recent Self-Supervised Audio Spectrogram Transformer (SSAST) model for speech and audio classification. Specifically, we leverage the insight that the SSAST uses a very high masking ratio (75%) during pretraining, meaning that the vast majority of self-attention compute is performed on mask tokens. We address this by integrating the encoder-decoder architecture from Masked Autoencoders are Scalable Vision Learners (MAE) into the SSAST, where a deep encoder operates on only unmasked input, and a shallow decoder operates on encoder outputs and mask tokens. We find that MAE-like pretraining can provide a 3x speedup and 2x memory usage reduction over the vanilla SSAST using current audio pretraining strategies with ordinary model and input sizes. When fine-tuning on downstream tasks, which only uses the encoder, we find that our approach outperforms the SSAST on a variety of downstream tasks. We further conduct comprehensive evaluations into different strategies of pretraining and explore differences in MAE-style pretraining between the visual and audio domains. Code at https://github.com/AlanBaade/MAEAST-Public
引用
收藏
页码:2438 / 2442
页数:5
相关论文
共 22 条
[1]  
[Anonymous], 2017, INT CONF ACOUST SPEE
[2]  
Baevski A., 2020, wav2vec 2.0: A framework for self-supervised learning of speech representations
[3]  
Brown, 2020, ARXIV
[4]   IEMOCAP: interactive emotional dyadic motion capture database [J].
Busso, Carlos ;
Bulut, Murtaza ;
Lee, Chi-Chun ;
Kazemzadeh, Abe ;
Mower, Emily ;
Kim, Samuel ;
Chang, Jeannette N. ;
Lee, Sungbok ;
Narayanan, Shrikanth S. .
LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) :335-359
[5]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[6]  
Dosovitskiy A, 2020, ARXIV
[7]   AST: Audio Spectrogram Transformer [J].
Gong, Yuan ;
Chung, Yu-An ;
Glass, James .
INTERSPEECH 2021, 2021, :571-575
[8]  
Gong Yuan, 2021, ARXIV211009784
[9]  
He Kaiming, 2021, Masked autoencoders are scalable vision learners
[10]   HUBERT: HOW MUCH CAN A BAD TEACHER BENEFIT ASR PRE-TRAINING? [J].
Hsu, Wei-Ning ;
Tsai, Yao-Hung Hubert ;
Bolte, Benjamin ;
Salakhutdinov, Ruslan ;
Mohamed, Abdelrahman .
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, :6533-6537