MambaGAN: Mamba based Metric GAN for Monaural Speech Enhancement

被引：1

作者：

Luo, Tianhao ^{[1
,2
,3
]}

Zhou, Feng ^{[1
,2
,3
,4
]}

Bai, Zhongxin ^{[1
,2
,3
]}

机构：

[1] Harbin Engn Univ, Natl Key Lab Underwater Acoust Technol, Harbin 15001, Peoples R China

[2] Harbin Engn Univ, Minist Ind & Informat Technol, Key Lab Marine Informat Acquisit & Secur, Harbin 150001, Peoples R China

[3] Harbin Engn Univ, Coll Underwater Acoust Engn, Harbin 150001, Peoples R China

[4] Harbin Engn Univ, Sanya Nanhai Innovat & Dev Base, Sanya 572024, Peoples R China

来源：

2024 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING, IALP 2024 | 2024年

关键词：

Mamba; Omin-dimension Dynamic Convolution; perceptual contrast stretching; speech enhancement;

D O I：

10.1109/IALP63756.2024.10661187

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Encoder-decoder structures are widely used in deep neural network-based speech enhancement (SE), often utilizing convolutional and transformer modules as basic components. The high computational demands of transformer often limit their real-time performance. To address this issue, we propose a novel speech enhancement network called MambaGAN by combining MambaFormer and ODConv within a GAN framework. MambaFormer is a structure based on Mamba to replace transformer in SE networks. Additionally, an Omni-dimensional Dynamic Convolution (ODConv) is introduced to replace convolutional modules for capturing richer speech features more flexibly. Experimental results on the VoiceBank+DEMAND dataset show that MambaGAN achieved an impressive PESQ score of 3.56. When combined with perceptual contrast stretching, it achieved a new state-of-the-art PESQ score of 3.72, while exhibiting lower computational complexity than existing conformer-based models.

引用

页码：411 / 416

页数：6

共 33 条

[1]

Abdulatif S, 2024, Arxiv, DOI arXiv:2209.11112

[2]

[Anonymous], 2007, Speech Enhancement: Theory and Practice

[3]

Berouti M., 1979, ICASSP 79. 1979 IEEE International Conference on Acoustics, Speech and Signal Processing, P208

[4] A consolidated view of loss functions for supervised deep learning-based speech enhancement [J].

Braun, Sebastian ;

Tashev, Ivan .

2021 44TH INTERNATIONAL CONFERENCE ON TELECOMMUNICATIONS AND SIGNAL PROCESSING (TSP), 2021, :72-76

[5]

Chao R, 2022, Arxiv, DOI arXiv:2203.17152

[6] DPT-FSNET: DUAL-PATH TRANSFORMER BASED FULL-BAND AND SUB-BAND FUSION NETWORK FOR SPEECH ENHANCEMENT [J].

Dang, Feng ;

Chen, Hangting ;

Zhangt, Pengyuan .

2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, :6857-6861

[7]

Dao T, 2022, Arxiv, DOI [arXiv:2212.14052, 10.48550/arXiv.2212.14052]

[8] STATISTICAL-MODEL-BASED SPEECH ENHANCEMENT SYSTEMS [J].

EPHRAIM, Y .

PROCEEDINGS OF THE IEEE, 1992, 80 (10) :1526-1555

[9]

Fu SW, 2019, PR MACH LEARN RES, V97

[10]

Gu A, 2024, Arxiv, DOI arXiv:2312.00752

← 1 2 3 4 →