MAE-AST: Masked Autoencoding Audio Spectrogram Transformer

被引:33
作者
Baade, Alan [1 ]
Peng, Puyuan [1 ]
Harwath, David [1 ]
机构
[1] Univ Texas Austin, Dept Comp Sci, Austin, TX 78712 USA
来源
INTERSPEECH 2022 | 2022年
关键词
audio classification; self-attention; Transformer; self-supervised;
D O I
10.21437/Interspeech.2022-10961
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
In this paper, we propose a simple yet powerful improvement over the recent Self-Supervised Audio Spectrogram Transformer (SSAST) model for speech and audio classification. Specifically, we leverage the insight that the SSAST uses a very high masking ratio (75%) during pretraining, meaning that the vast majority of self-attention compute is performed on mask tokens. We address this by integrating the encoder-decoder architecture from Masked Autoencoders are Scalable Vision Learners (MAE) into the SSAST, where a deep encoder operates on only unmasked input, and a shallow decoder operates on encoder outputs and mask tokens. We find that MAE-like pretraining can provide a 3x speedup and 2x memory usage reduction over the vanilla SSAST using current audio pretraining strategies with ordinary model and input sizes. When fine-tuning on downstream tasks, which only uses the encoder, we find that our approach outperforms the SSAST on a variety of downstream tasks. We further conduct comprehensive evaluations into different strategies of pretraining and explore differences in MAE-style pretraining between the visual and audio domains. Code at https://github.com/AlanBaade/MAEAST-Public
引用
收藏
页码:2438 / 2442
页数:5
相关论文
共 22 条
  • [1] Baevski A., 2020, wav2vec 2.0: A Framework for SelfSupervised Learning of Speech Representations
  • [2] Brown T. B., 2020, P 34 INT C NEUR INF
  • [3] IEMOCAP: interactive emotional dyadic motion capture database
    Busso, Carlos
    Bulut, Murtaza
    Lee, Chi-Chun
    Kazemzadeh, Abe
    Mower, Emily
    Kim, Samuel
    Chang, Jeannette N.
    Lee, Sungbok
    Narayanan, Shrikanth S.
    [J]. LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) : 335 - 359
  • [4] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
  • [5] Dosovitskiy A., 2020, INT C LEARN REPR
  • [6] Gemmeke J. F., 2017, 2017 IEEE INT C ACOU, P776
  • [7] Gong Y, 2021, ARXIV211009784
  • [8] AST: Audio Spectrogram Transformer
    Gong, Yuan
    Chung, Yu-An
    Glass, James
    [J]. INTERSPEECH 2021, 2021, : 571 - 575
  • [9] He Kaiming, 2021, Masked autoencoders are scalable vision learners
  • [10] HUBERT: HOW MUCH CAN A BAD TEACHER BENEFIT ASR PRE-TRAINING?
    Hsu, Wei-Ning
    Tsai, Yao-Hung Hubert
    Bolte, Benjamin
    Salakhutdinov, Ruslan
    Mohamed, Abdelrahman
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6533 - 6537