MFT-CRN:Multi-scale Fourier Transform for Monaural Speech Enhancement

被引:0
作者
Wang, Yulong [1 ]
Zhang, Xueliang [1 ]
机构
[1] Inner Mongolia Univ, Coll Comp Sci, Hohhot, Peoples R China
来源
INTERSPEECH 2023 | 2023年
关键词
monaural speech enhancement; frequency domain; short-time fourier transform; multi-scale fusion;
D O I
10.21437/Interspeech.2023-865
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Convolutional recurrent networks (CRN) that combine a convolutional encoder-decoder (CED) structure with a recurrent structure have shown promising results in monaural speech enhancement. However, the commonly used short-time Fourier transform fails to balance the needs of frequency and time resolution effectively, which is crucial for accurate speech estimation. To address this issue, we propose MFT-CRN, a multi-scale short-time Fourier transform fusion model. We process the input speech signal through short-time Fourier transforms with different window functions, and add them layer by layer in the encoder and decoder of the network to achieve feature fusion with different window functions, effectively balancing frequency and temporal resolution. Comprehensive experiments on the WSJ0 dataset show that MFT-CRN significantly outperforms the method using only a single window function in terms of short-time intelligibility and perceptual evaluation of speech quality.
引用
收藏
页码:1060 / 1064
页数:5
相关论文
共 21 条
  • [1] Clevert D.-A., 2016, P 4 INT C LEARN REPR
  • [2] UFORMER: A UNET BASED DILATED COMPLEX & REAL DUAL-PATH CONFORMER NETWORK FOR SIMULTANEOUS SPEECH ENHANCEMENT AND DEREVERBERATION
    Fu, Yihui
    Liu, Yun
    Li, Jingdong
    Luo, Dawei
    Lv, Shubo
    Jv, Yukai
    Xie, Lei
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7417 - 7421
  • [3] Graves A, 2012, STUD COMPUT INTELL, V385, P1, DOI [10.1162/neco.1997.9.1.1, 10.1007/978-3-642-24797-2]
  • [4] Ioffe S., 2015, 32 INT C MACH LEARN
  • [5] Kingma D. P., 2014, arXiv
  • [6] Li Q., 2021, ARXIV210204629
  • [7] Lv S., 2021, ARXIV210608672
  • [8] S-DCCRN: SUPER WIDE BAND DCCRN WITH LEARNABLE COMPLEX FEATURE FOR SPEECH ENHANCEMENT
    Lv, Shubo
    Fu, Yihui
    Xing, Mengtao
    Sun, Jiayao
    Xie, Lei
    Huang, Jun
    Wang, Yannan
    Yu, Tao
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7767 - 7771
  • [9] SIGNAL RECONSTRUCTION FROM SHORT-TIME FOURIER-TRANSFORM MAGNITUDE
    NAWAB, SH
    QUATIERI, TF
    LIM, JS
    [J]. IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1983, 31 (04): : 986 - 998
  • [10] Park S. R., 2016, arXiv preprint arXiv:1609.07132