MambaGAN: Mamba based Metric GAN for Monaural Speech Enhancement

被引:1
作者
Luo, Tianhao [1 ,2 ,3 ]
Zhou, Feng [1 ,2 ,3 ,4 ]
Bai, Zhongxin [1 ,2 ,3 ]
机构
[1] Harbin Engn Univ, Natl Key Lab Underwater Acoust Technol, Harbin 15001, Peoples R China
[2] Harbin Engn Univ, Minist Ind & Informat Technol, Key Lab Marine Informat Acquisit & Secur, Harbin 150001, Peoples R China
[3] Harbin Engn Univ, Coll Underwater Acoust Engn, Harbin 150001, Peoples R China
[4] Harbin Engn Univ, Sanya Nanhai Innovat & Dev Base, Sanya 572024, Peoples R China
来源
2024 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING, IALP 2024 | 2024年
关键词
Mamba; Omin-dimension Dynamic Convolution; perceptual contrast stretching; speech enhancement;
D O I
10.1109/IALP63756.2024.10661187
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Encoder-decoder structures are widely used in deep neural network-based speech enhancement (SE), often utilizing convolutional and transformer modules as basic components. The high computational demands of transformer often limit their real-time performance. To address this issue, we propose a novel speech enhancement network called MambaGAN by combining MambaFormer and ODConv within a GAN framework. MambaFormer is a structure based on Mamba to replace transformer in SE networks. Additionally, an Omni-dimensional Dynamic Convolution (ODConv) is introduced to replace convolutional modules for capturing richer speech features more flexibly. Experimental results on the VoiceBank+DEMAND dataset show that MambaGAN achieved an impressive PESQ score of 3.56. When combined with perceptual contrast stretching, it achieved a new state-of-the-art PESQ score of 3.72, while exhibiting lower computational complexity than existing conformer-based models.
引用
收藏
页码:411 / 416
页数:6
相关论文
共 33 条
[1]  
Abdulatif S, 2024, Arxiv, DOI arXiv:2209.11112
[2]  
[Anonymous], 2007, Speech Enhancement: Theory and Practice
[3]  
Berouti M., 1979, ICASSP 79. 1979 IEEE International Conference on Acoustics, Speech and Signal Processing, P208
[4]   A consolidated view of loss functions for supervised deep learning-based speech enhancement [J].
Braun, Sebastian ;
Tashev, Ivan .
2021 44TH INTERNATIONAL CONFERENCE ON TELECOMMUNICATIONS AND SIGNAL PROCESSING (TSP), 2021, :72-76
[5]  
Chao R, 2022, Arxiv, DOI arXiv:2203.17152
[6]   DPT-FSNET: DUAL-PATH TRANSFORMER BASED FULL-BAND AND SUB-BAND FUSION NETWORK FOR SPEECH ENHANCEMENT [J].
Dang, Feng ;
Chen, Hangting ;
Zhangt, Pengyuan .
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, :6857-6861
[7]  
Dao T, 2022, Arxiv, DOI [arXiv:2212.14052, 10.48550/arXiv.2212.14052]
[8]   STATISTICAL-MODEL-BASED SPEECH ENHANCEMENT SYSTEMS [J].
EPHRAIM, Y .
PROCEEDINGS OF THE IEEE, 1992, 80 (10) :1526-1555
[9]  
Fu SW, 2019, PR MACH LEARN RES, V97
[10]  
Gu A, 2024, Arxiv, DOI arXiv:2312.00752