AN INVESTIGATION OF INCORPORATING MAMBA FOR SPEECH ENHANCEMENT

被引:3
作者
Chao, Rong [1 ,2 ]
Cheng, Wen-Huang [2 ]
La Quatra, Moreno [3 ]
Siniscalchi, Sabato Marco [4 ]
Yang, Chao-Han Huck [5 ]
Fu, Szu-Wei [5 ]
Tsao, Yu [1 ]
机构
[1] Acad Sinica, Taipei, Taiwan
[2] Natl Taiwan Univ, Taipei, Taiwan
[3] Kore Univ Enna, Enna, Italy
[4] Univ Palermo, Palermo, Italy
[5] NVIDIA, Santa Clara, CA USA
来源
2024 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT | 2024年
关键词
consistency loss; Mamba; speech enhancement; state-space machine; SEMamba;
D O I
10.1109/SLT61566.2024.10832332
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This work aims to investigate the use of a recently proposed, attention-free, scalable state-space model (SSM), Mamba, for the speech enhancement (SE) task. In particular, we employ Mamba to deploy different regression-based SE models (SEMamba) with different configurations, namely basic, advanced, causal, and non-causal. Furthermore, loss functions either based on signal-level distances or metric-oriented are considered. Experimental evidence shows that SEMamba attains a competitive PESQ of 3.55 on the VoiceBank-DEMAND dataset with the advanced, non-causal configuration. A new state-of-the-art PESQ of 3.69 is also reported when SEMamba is combined with Perceptual Contrast Stretching (PCS). Compared against Transformed-based equivalent SE solutions, a noticeable FLOPs reduction up to similar to 12% is observed with the advanced non-causal configurations. Finally, SEMamba can be used as a pre-processing step before automatic speech recognition (ASR), showing competitive performance against recent SE solutions.
引用
收藏
页码:302 / 308
页数:7
相关论文
共 45 条
[1]   CMGAN: Conformer-Based Metric-GAN for Monaural Speech Enhancement [J].
Abdulatif, Sherif ;
Cao, Ruizhe ;
Yang, Bin .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 :2477-2493
[2]   Perceptual Contrast Stretching on Target Feature for Speech Enhancement [J].
Chao, Rong ;
Yu, Cheng ;
Fu, Szu-Wei ;
Lu, Xugang ;
Tsao, Yu .
INTERSPEECH 2022, 2022, :5448-5452
[3]  
Chen Z, 2015, 16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, P3274
[4]  
Dang F., 2022, P ICASSP
[5]  
Dao T, 2022, Arxiv, DOI [arXiv:2212.14052, 10.48550/arXiv.2212.14052]
[6]   Real Time Speech Enhancement in the Waveform Domain [J].
Defossez, Alexandre ;
Synnaeve, Gabriel ;
Adi, Yossi .
INTERSPEECH 2020, 2020, :3291-3295
[7]   SPIKING STRUCTURED STATE SPACE MODEL FOR MONAURAL SPEECH ENHANCEMENT [J].
Du, Yu ;
Liu, Xu ;
Chua, Yansong .
2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, ICASSP 2024, 2024, :766-770
[8]  
Fu S.-W., 2020, P INTERSPEECH
[9]  
Fu S.-W., 2020, P APSIPA
[10]  
Fu SW, 2019, PR MACH LEARN RES, V97