DeConformer-SENet: An efficient deformable conformer speech enhancement network

被引：0

作者：

Li, Man ^{[1
]}

Liu, Ya ^{[1
]}

Zhou, Li ^{[1
]}

机构：

[1] Hubei Univ Chinese Med, Sch Foreign Languages, Wuhan, Peoples R China

来源：

DIGITAL SIGNAL PROCESSING | 2025年 / 156卷

关键词：

Monaural speech enhancement; Conformer; T-F-C self-attention; Deformable convolution; CONVOLUTIONAL RECURRENT NETWORKS; NEURAL-NETWORK; NOISE; SUPPRESSION; QUALITY;

D O I：

10.1016/j.dsp.2024.104787

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

The Conformer model has demonstrated superior performance in speech enhancement by combining the long-range relationship modeling capability of self-attention with the local information processing ability of convolutional neural networks (CNNs). However, existing Conformer-based speech enhancement models struggle to balance performance and model complexity. In this work, we propose, DeConformer-SENet, an end-to-end time-domain deformable Conformer speech enhancement model, with modifications to both the self-attention and CNN components. Firstly, we introduce the time-frequency-channel self-attention (TFC-SA) module, which compresses information from each dimension of the input features into a one-dimensional vector. By calculating the energy distribution, this module models long-range relationships across three dimensions, reducing computational complexity while maintaining performance. Additionally, we replace standard convolutions with deformable convolutions, aiming to expand the receptive field of the CNN and accurately model local features. We validate our proposed DeConformer-SENet on the WSJ0-SI84 + DNS Challenge dataset. Experimental results demonstrate that DeConformer-SENet outperforms existing Conformer and Transformer models in terms of ESTOI and PESQ metrics, while also being more computationally efficient. Furthermore, ablation studies confirm that DeConformer-SENet improvements enhance the performance of conventional Conformer and reduce model complexity without compromising the overall effectiveness.

引用

页数：10

共 62 条

[1] SUPPRESSION OF ACOUSTIC NOISE IN SPEECH USING SPECTRAL SUBTRACTION
BOLL, SF
[J]. IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1979, 27 (02): : 113 - 120
[2] FullSubNet plus : CHANNEL ATTENTION FULLSUBNET WITH COMPLEX SPECTROGRAMS FOR SPEECH ENHANCEMENT
Chen, Jun
Wang, Zilin
Tuo, Deyi
Wu, Zhiyong
Kang, Shiyin
Meng, Helen
[J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7857 - 7861
[3] REAL-TIME DENOISING AND DEREVERBERATION WTIH TINY RECURRENT U-NET
Choi, Hyeong-Seok
Park, Sungjin
Lee, Jie Hwan
Heo, Hoon
Jeon, Dongsuk
Lee, Kyogu
[J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 5789 - 5793
[4] Deformable Convolutional Networks
Dai, Jifeng
Qi, Haozhi
Xiong, Yuwen
Li, Yi
Zhang, Guodong
Hu, Han
Wei, Yichen
[J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 764 - 773
[5] Defossez A., 2020, arXiv
[6] A SIGNAL SUBSPACE APPROACH FOR SPEECH ENHANCEMENT
EPHRAIM, Y
VANTREES, HL
[J]. IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 1995, 3 (04): : 251 - 266
[7] SPEECH ENHANCEMENT USING A MINIMUM MEAN-SQUARE ERROR SHORT-TIME SPECTRAL AMPLITUDE ESTIMATOR
EPHRAIM, Y
MALAH, D
[J]. IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1984, 32 (06): : 1109 - 1121
[8] Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation
Ephrat, Ariel
Mosseri, Inbar
Lang, Oran
Dekel, Tali
Wilson, Kevin
Hassidim, Avinatan
Freeman, William T.
Rubinstein, Michael
[J]. ACM TRANSACTIONS ON GRAPHICS, 2018, 37 (04):
[9] Erdogan H, 2015, INT CONF ACOUST SPEE, P708, DOI 10.1109/ICASSP.2015.7178061
[10] Fu SW, 2019, PR MACH LEARN RES, V97

← 1 2 3 4 5 6 7 →