AN INVESTIGATION OF INCORPORATING MAMBA FOR SPEECH ENHANCEMENT

被引：3

作者：

Chao, Rong ^{[1
,2
]}

Cheng, Wen-Huang ^{[2
]}

La Quatra, Moreno ^{[3
]}

Siniscalchi, Sabato Marco ^{[4
]}

Yang, Chao-Han Huck ^{[5
]}

Fu, Szu-Wei ^{[5
]}

Tsao, Yu ^{[1
]}

机构：

[1] Acad Sinica, Taipei, Taiwan

[2] Natl Taiwan Univ, Taipei, Taiwan

[3] Kore Univ Enna, Enna, Italy

[4] Univ Palermo, Palermo, Italy

[5] NVIDIA, Santa Clara, CA USA

来源：

2024 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT | 2024年

关键词：

consistency loss; Mamba; speech enhancement; state-space machine; SEMamba;

D O I：

10.1109/SLT61566.2024.10832332

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

This work aims to investigate the use of a recently proposed, attention-free, scalable state-space model (SSM), Mamba, for the speech enhancement (SE) task. In particular, we employ Mamba to deploy different regression-based SE models (SEMamba) with different configurations, namely basic, advanced, causal, and non-causal. Furthermore, loss functions either based on signal-level distances or metric-oriented are considered. Experimental evidence shows that SEMamba attains a competitive PESQ of 3.55 on the VoiceBank-DEMAND dataset with the advanced, non-causal configuration. A new state-of-the-art PESQ of 3.69 is also reported when SEMamba is combined with Perceptual Contrast Stretching (PCS). Compared against Transformed-based equivalent SE solutions, a noticeable FLOPs reduction up to similar to 12% is observed with the advanced non-causal configurations. Finally, SEMamba can be used as a pre-processing step before automatic speech recognition (ASR), showing competitive performance against recent SE solutions.

引用

页码：302 / 308

页数：7

共 45 条

[1] CMGAN: Conformer-Based Metric-GAN for Monaural Speech Enhancement [J].

Abdulatif, Sherif ;

Cao, Ruizhe ;

Yang, Bin .

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 :2477-2493

[2] Perceptual Contrast Stretching on Target Feature for Speech Enhancement [J].

Chao, Rong ;

Yu, Cheng ;

Fu, Szu-Wei ;

Lu, Xugang ;

Tsao, Yu .

INTERSPEECH 2022, 2022, :5448-5452

[3]

Chen Z, 2015, 16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, P3274

[4]

Dang F., 2022, P ICASSP

[5]

Dao T, 2022, Arxiv, DOI [arXiv:2212.14052, 10.48550/arXiv.2212.14052]

[6] Real Time Speech Enhancement in the Waveform Domain [J].

Defossez, Alexandre ;

Synnaeve, Gabriel ;

Adi, Yossi .

INTERSPEECH 2020, 2020, :3291-3295

[7] SPIKING STRUCTURED STATE SPACE MODEL FOR MONAURAL SPEECH ENHANCEMENT [J].

Du, Yu ;

Liu, Xu ;

Chua, Yansong .

2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, ICASSP 2024, 2024, :766-770

[8]

Fu S.-W., 2020, P INTERSPEECH

[9]

Fu S.-W., 2020, P APSIPA

[10]

Fu SW, 2019, PR MACH LEARN RES, V97

← 1 2 3 4 5 →