Linear-Complexity Self-Supervised Learning for Speech Processing

被引:0
作者
Zhang, Shucong [1 ]
Parcollet, Titouan [1 ]
van Dalen, Rogier [1 ]
Bhattacharya, Sourav [1 ]
机构
[1] Samsung AI Ctr Cambridge, Cambridge, England
来源
INTERSPEECH 2024 | 2024年
关键词
self-supervised learning; efficient models;
D O I
10.21437/Interspeech.2024-500
中图分类号
学科分类号
摘要
Self-supervised learning (SSL) models usually require weeks of pre-training with dozens of high-end GPUs. These models typically have a multi-headed self-attention (MHSA) context encoder. However, MHSA takes quadratic time and space in the input length, contributing to the high pre-training cost. Linear-complexity alternatives to MHSA have been proposed. For instance, in supervised training, the SummaryMixing model is the first to outperform MHSA across multiple speech processing tasks. However, these cheaper alternatives have not been explored for SSL yet. This paper studies a linear-complexity context encoder for SSL for the first time. With better or equivalent performance for the downstream tasks of the MP3S benchmark, SummaryMixing reduces the pre-training time and peak VRAM of wav2vec 2.0 model by 18% and by 23%, respectively, leading to the pre-training of a 155M wav2vec 2.0 model finished within one week with 4 Tesla A100 GPUs. Code(1) is available.
引用
收藏
页码:3480 / 3484
页数:5
相关论文
共 50 条
  • [31] TARGET SPEECH EXTRACTION WITH PRE-TRAINED SELF-SUPERVISED LEARNING MODELS
    Peng, Junyi
    Delcroix, Marc
    Ochiai, Tsubasa
    Plchot, Oldrich
    Araki, Shoko
    Cemocky, Jan
    2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2024), 2024, : 10421 - 10425
  • [32] Evaluating Self-Supervised Speech Representations for Speech Emotion Recognition
    Atmaja, Bagus Tris
    Sasou, Akira
    IEEE ACCESS, 2022, 10 : 124396 - 124407
  • [33] FitHuBERT: Going Thinner and Deeper for Knowledge Distillation of Speech Self-Supervised Learning
    Lee, Yeonghyeon
    Jang, Kangwook
    Goo, Jahyun
    Jung, Youngmoon
    Kim, Hoirin
    INTERSPEECH 2022, 2022, : 3588 - 3592
  • [34] PROBING SELF-SUPERVISED LEARNING MODELS WITH TARGET SPEECH EXTRACTION<bold> </bold>
    Peng, Junyi
    Delcroix, Marc
    Ochiai, Tsubasa
    Plchot, Oldrich
    Ashihara, Takanori
    Araki, Shoko
    Cernocky, Jan
    2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW 2024, 2024, : 535 - 539
  • [35] SKILL: SIMILARITY-AWARE KNOWLEDGE DISTILLATION FOR SPEECH SELF-SUPERVISED LEARNING
    Zampierin, Luca
    Hacene, Ghouthi Boukli
    Nguyen, Bac
    Ravanelli, Mirco
    2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW 2024, 2024, : 675 - 679
  • [36] HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units
    Hsu, Wei-Ning
    Bolte, Benjamin
    Tsai, Yao-Hung Hubert
    Lakhotia, Kushal
    Salakhutdinov, Ruslan
    Mohamed, Abdelrahman
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 : 3451 - 3460
  • [37] Conditional Independence for Pretext Task Selection in Self-Supervised Speech Representation Learning
    Zaiem, Salah
    Parcollet, Titouan
    Essid, Slim
    INTERSPEECH 2021, 2021, : 2851 - 2855
  • [38] JOINT LEARNING WITH SHARED LATENT SPACE FOR SELF-SUPERVISED MONAURAL SPEECH ENHANCEMENT
    Li, Yi
    Sun, Yang
    Wang, Wenwu
    Naqvi, Syed Mohsen
    2023 SENSOR SIGNAL PROCESSING FOR DEFENCE CONFERENCE, SSPD, 2023, : 21 - 25
  • [39] Learnable Layer Selection and Model Fusion for Speech Self-Supervised Learning Models
    Chiu, Sheng-Chieh
    Wu, Chia-Hua
    Hsieh, Jih-Kang
    Tsao, Yu
    Wang, Hsin-Min
    INTERSPEECH 2024, 2024, : 3914 - 3918
  • [40] Understanding Self-Supervised Learning of Speech Representation via Invariance and Redundancy Reduction
    Brima, Yusuf
    Krumnack, Ulf
    Pika, Simone
    Heidemann, Gunther
    INFORMATION, 2024, 15 (02)