MN-Net: Speech Enhancement Network via Modeling the Noise

被引:0
作者
Hu, Ying [1 ]
Yang, Qin [1 ]
Wei, Wenbing [1 ]
Lin, Li [2 ]
He, Liang [3 ]
Ou, Zhijian [4 ]
Yang, Wenzhong [1 ]
机构
[1] Xinjiang Univ, Sch Comp Sci & Technol, Key Lab Signal Detect & Proc, Urumqi 830049, Peoples R China
[2] Xinjiang Inst Elect Res Shares Co Ltd, Urumqi 830026, Peoples R China
[3] Tsinghua Univ, Dept Elect Engn, Beijing 100190, Peoples R China
[4] Tsinghua Univ, Beijing Natl Res Ctr Informat Sci & Technol, Dept Elect, Beijing 100084, Peoples R China
来源
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2025年 / 33卷
关键词
Noise; Feature extraction; Spectrogram; Speech enhancement; Decoding; Kernel; Transformers; Noise reduction; Data mining; Training; modeling noise; noise decoder; magnitude spectrogram; phase spectrogram; Multi-Branch Feature Extractor (MBFE); spatial reconstruction UnitC(SRU);
D O I
10.1109/TASLPRO.2025.3546819
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Currently, deep learning-based speech enhancement methods generally focus on target speech extraction while neglecting modeling the other sound sources in the mixture. These methods still can't distinguish the target speech from the interference well. In this paper, we present a monaural speech enhancement network via Modeling the Noise (MN-Net), which includes a shared Encoder and three separate Decoders for parallel modeling the magnitude and phase spectrogram of target speech, and the complex spectrogram of noise. Specifically, we propose a Multi-Branch Feature Extractor (MBFE) module to capture the richer contextual information in mixture, and a Spatial Reconstruction Unit (SRU) to remove the redundancy from extracted features. We compared our proposed MN-Net with 18 classical speech enhancement methods on the VoiceBank+DEMAND dataset, and with 9 ones on DNS-Challenge dataset for denoising task, and with 7 ones on the WHAMR! dataset for simultaneous denoising & de-reverberation task. Our proposed MBFE module was applied to two classical speech enhancement methods, DB-AIAT and CMGAN, replacing their DenseBlocks module. The results demonstrate that applying the MBFE module can boost their performances while keeping smaller model size. A series of visualization analysis intuitively verify that modeling the noise can enable the network to distinguish the target speech from noise and other interference more accurately.
引用
收藏
页码:1208 / 1219
页数:12
相关论文
共 63 条
[1]   CMGAN: Conformer-Based Metric-GAN for Monaural Speech Enhancement [J].
Abdulatif, Sherif ;
Cao, Ruizhe ;
Yang, Bin .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 :2477-2493
[2]   AN INVESTIGATION OF INCORPORATING MAMBA FOR SPEECH ENHANCEMENT [J].
Chao, Rong ;
Cheng, Wen-Huang ;
La Quatra, Moreno ;
Siniscalchi, Sabato Marco ;
Yang, Chao-Han Huck ;
Fu, Szu-Wei ;
Tsao, Yu .
2024 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2024, :302-308
[3]   DPT-FSNET: DUAL-PATH TRANSFORMER BASED FULL-BAND AND SUB-BAND FUSION NETWORK FOR SPEECH ENHANCEMENT [J].
Dang, Feng ;
Chen, Hangting ;
Zhangt, Pengyuan .
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, :6857-6861
[4]   Real Time Speech Enhancement in the Waveform Domain [J].
Defossez, Alexandre ;
Synnaeve, Gabriel ;
Adi, Yossi .
INTERSPEECH 2020, 2020, :3291-3295
[5]   SpecMNet: Spectrum mend network for monaural speech enhancement [J].
Fan, Cunhang ;
Zhang, Hongmei ;
Yi, Jiangyan ;
Lv, Zhao ;
Tao, Jianhua ;
Li, Taihao ;
Pei, Guanxiong ;
Wu, Xiaopei ;
Li, Sheng .
APPLIED ACOUSTICS, 2022, 194
[6]  
Fu SW, 2019, PR MACH LEARN RES, V97
[7]   MetricGAN plus : An Improved Version of MetricGAN for Speech Enhancement [J].
Fu, Szu-Wei ;
Yu, Cheng ;
Hsieh, Tsun-An ;
Plantinga, Peter ;
Ravanelli, Mirco ;
Lu, Xugang ;
Tsao, Yu .
INTERSPEECH 2021, 2021, :201-205
[8]  
Gemmeke JF, 2017, INT CONF ACOUST SPEE, P776, DOI 10.1109/ICASSP.2017.7952261
[9]   FULLSUBNET: A FULL-BAND AND SUB-BAND FUSION MODEL FOR REAL-TIME SINGLE-CHANNEL SPEECH ENHANCEMENT [J].
Hao, Xiang ;
Su, Xiangdong ;
Horaud, Radu ;
Li, Xiaofei .
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, :6633-6637
[10]   Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification [J].
He, Kaiming ;
Zhang, Xiangyu ;
Ren, Shaoqing ;
Sun, Jian .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :1026-1034