DeConformer-SENet: An efficient deformable conformer speech enhancement network

被引:0
作者
Li, Man [1 ]
Liu, Ya [1 ]
Zhou, Li [1 ]
机构
[1] Hubei Univ Chinese Med, Sch Foreign Languages, Wuhan, Peoples R China
关键词
Monaural speech enhancement; Conformer; T-F-C self-attention; Deformable convolution; CONVOLUTIONAL RECURRENT NETWORKS; NEURAL-NETWORK; NOISE; SUPPRESSION; QUALITY;
D O I
10.1016/j.dsp.2024.104787
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
The Conformer model has demonstrated superior performance in speech enhancement by combining the long-range relationship modeling capability of self-attention with the local information processing ability of convolutional neural networks (CNNs). However, existing Conformer-based speech enhancement models struggle to balance performance and model complexity. In this work, we propose, DeConformer-SENet, an end-to-end time-domain deformable Conformer speech enhancement model, with modifications to both the self-attention and CNN components. Firstly, we introduce the time-frequency-channel self-attention (TFC-SA) module, which compresses information from each dimension of the input features into a one-dimensional vector. By calculating the energy distribution, this module models long-range relationships across three dimensions, reducing computational complexity while maintaining performance. Additionally, we replace standard convolutions with deformable convolutions, aiming to expand the receptive field of the CNN and accurately model local features. We validate our proposed DeConformer-SENet on the WSJ0-SI84 + DNS Challenge dataset. Experimental results demonstrate that DeConformer-SENet outperforms existing Conformer and Transformer models in terms of ESTOI and PESQ metrics, while also being more computationally efficient. Furthermore, ablation studies confirm that DeConformer-SENet improvements enhance the performance of conventional Conformer and reduce model complexity without compromising the overall effectiveness.
引用
收藏
页数:10
相关论文
共 62 条
  • [1] SUPPRESSION OF ACOUSTIC NOISE IN SPEECH USING SPECTRAL SUBTRACTION
    BOLL, SF
    [J]. IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1979, 27 (02): : 113 - 120
  • [2] FullSubNet plus : CHANNEL ATTENTION FULLSUBNET WITH COMPLEX SPECTROGRAMS FOR SPEECH ENHANCEMENT
    Chen, Jun
    Wang, Zilin
    Tuo, Deyi
    Wu, Zhiyong
    Kang, Shiyin
    Meng, Helen
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7857 - 7861
  • [3] REAL-TIME DENOISING AND DEREVERBERATION WTIH TINY RECURRENT U-NET
    Choi, Hyeong-Seok
    Park, Sungjin
    Lee, Jie Hwan
    Heo, Hoon
    Jeon, Dongsuk
    Lee, Kyogu
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 5789 - 5793
  • [4] Deformable Convolutional Networks
    Dai, Jifeng
    Qi, Haozhi
    Xiong, Yuwen
    Li, Yi
    Zhang, Guodong
    Hu, Han
    Wei, Yichen
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 764 - 773
  • [5] Defossez A., 2020, arXiv
  • [6] A SIGNAL SUBSPACE APPROACH FOR SPEECH ENHANCEMENT
    EPHRAIM, Y
    VANTREES, HL
    [J]. IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 1995, 3 (04): : 251 - 266
  • [7] SPEECH ENHANCEMENT USING A MINIMUM MEAN-SQUARE ERROR SHORT-TIME SPECTRAL AMPLITUDE ESTIMATOR
    EPHRAIM, Y
    MALAH, D
    [J]. IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1984, 32 (06): : 1109 - 1121
  • [8] Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation
    Ephrat, Ariel
    Mosseri, Inbar
    Lang, Oran
    Dekel, Tali
    Wilson, Kevin
    Hassidim, Avinatan
    Freeman, William T.
    Rubinstein, Michael
    [J]. ACM TRANSACTIONS ON GRAPHICS, 2018, 37 (04):
  • [9] Erdogan H, 2015, INT CONF ACOUST SPEE, P708, DOI 10.1109/ICASSP.2015.7178061
  • [10] Fu SW, 2019, PR MACH LEARN RES, V97