Linear-Complexity Self-Supervised Learning for Speech Processing

被引：0

作者：

Zhang, Shucong ^{[1
]}

Parcollet, Titouan ^{[1
]}

van Dalen, Rogier ^{[1
]}

Bhattacharya, Sourav ^{[1
]}

机构：

[1] Samsung AI Ctr Cambridge, Cambridge, England

来源：

INTERSPEECH 2024 | 2024年

关键词：

self-supervised learning; efficient models;

D O I：

10.21437/Interspeech.2024-500

中图分类号：

学科分类号：

摘要：

Self-supervised learning (SSL) models usually require weeks of pre-training with dozens of high-end GPUs. These models typically have a multi-headed self-attention (MHSA) context encoder. However, MHSA takes quadratic time and space in the input length, contributing to the high pre-training cost. Linear-complexity alternatives to MHSA have been proposed. For instance, in supervised training, the SummaryMixing model is the first to outperform MHSA across multiple speech processing tasks. However, these cheaper alternatives have not been explored for SSL yet. This paper studies a linear-complexity context encoder for SSL for the first time. With better or equivalent performance for the downstream tasks of the MP3S benchmark, SummaryMixing reduces the pre-training time and peak VRAM of wav2vec 2.0 model by 18% and by 23%, respectively, leading to the pre-training of a 155M wav2vec 2.0 model finished within one week with 4 Tesla A100 GPUs. Code(1) is available.

引用

页码：3480 / 3484

页数：5

共 50 条

[31] TARGET SPEECH EXTRACTION WITH PRE-TRAINED SELF-SUPERVISED LEARNING MODELS
Peng, Junyi
Delcroix, Marc
Ochiai, Tsubasa
Plchot, Oldrich
Araki, Shoko
Cemocky, Jan
2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2024), 2024, : 10421 - 10425
[32] Evaluating Self-Supervised Speech Representations for Speech Emotion Recognition
Atmaja, Bagus Tris
Sasou, Akira
IEEE ACCESS, 2022, 10 : 124396 - 124407
[33] FitHuBERT: Going Thinner and Deeper for Knowledge Distillation of Speech Self-Supervised Learning
Lee, Yeonghyeon
Jang, Kangwook
Goo, Jahyun
Jung, Youngmoon
Kim, Hoirin
INTERSPEECH 2022, 2022, : 3588 - 3592
[34] PROBING SELF-SUPERVISED LEARNING MODELS WITH TARGET SPEECH EXTRACTION<bold> </bold>
Peng, Junyi
Delcroix, Marc
Ochiai, Tsubasa
Plchot, Oldrich
Ashihara, Takanori
Araki, Shoko
Cernocky, Jan
2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW 2024, 2024, : 535 - 539
[35] SKILL: SIMILARITY-AWARE KNOWLEDGE DISTILLATION FOR SPEECH SELF-SUPERVISED LEARNING
Zampierin, Luca
Hacene, Ghouthi Boukli
Nguyen, Bac
Ravanelli, Mirco
2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW 2024, 2024, : 675 - 679
[36] HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units
Hsu, Wei-Ning
Bolte, Benjamin
Tsai, Yao-Hung Hubert
Lakhotia, Kushal
Salakhutdinov, Ruslan
Mohamed, Abdelrahman
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 : 3451 - 3460
[37] Conditional Independence for Pretext Task Selection in Self-Supervised Speech Representation Learning
Zaiem, Salah
Parcollet, Titouan
Essid, Slim
INTERSPEECH 2021, 2021, : 2851 - 2855
[38] JOINT LEARNING WITH SHARED LATENT SPACE FOR SELF-SUPERVISED MONAURAL SPEECH ENHANCEMENT
Li, Yi
Sun, Yang
Wang, Wenwu
Naqvi, Syed Mohsen
2023 SENSOR SIGNAL PROCESSING FOR DEFENCE CONFERENCE, SSPD, 2023, : 21 - 25
[39] Learnable Layer Selection and Model Fusion for Speech Self-Supervised Learning Models
Chiu, Sheng-Chieh
Wu, Chia-Hua
Hsieh, Jih-Kang
Tsao, Yu
Wang, Hsin-Min
INTERSPEECH 2024, 2024, : 3914 - 3918
[40] Understanding Self-Supervised Learning of Speech Representation via Invariance and Redundancy Reduction
Brima, Yusuf
Krumnack, Ulf
Pika, Simone
Heidemann, Gunther
INFORMATION, 2024, 15 (02)

← 1 2 3 4 5 →