MFT-CRN:Multi-scale Fourier Transform for Monaural Speech Enhancement

被引：0

作者：

Wang, Yulong ^{[1
]}

Zhang, Xueliang ^{[1
]}

机构：

[1] Inner Mongolia Univ, Coll Comp Sci, Hohhot, Peoples R China

来源：

INTERSPEECH 2023 | 2023年

关键词：

monaural speech enhancement; frequency domain; short-time fourier transform; multi-scale fusion;

D O I：

10.21437/Interspeech.2023-865

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Convolutional recurrent networks (CRN) that combine a convolutional encoder-decoder (CED) structure with a recurrent structure have shown promising results in monaural speech enhancement. However, the commonly used short-time Fourier transform fails to balance the needs of frequency and time resolution effectively, which is crucial for accurate speech estimation. To address this issue, we propose MFT-CRN, a multi-scale short-time Fourier transform fusion model. We process the input speech signal through short-time Fourier transforms with different window functions, and add them layer by layer in the encoder and decoder of the network to achieve feature fusion with different window functions, effectively balancing frequency and temporal resolution. Comprehensive experiments on the WSJ0 dataset show that MFT-CRN significantly outperforms the method using only a single window function in terms of short-time intelligibility and perceptual evaluation of speech quality.

引用

页码：1060 / 1064

页数：5

共 21 条

[1] Clevert D.-A., 2016, P 4 INT C LEARN REPR
[2] UFORMER: A UNET BASED DILATED COMPLEX & REAL DUAL-PATH CONFORMER NETWORK FOR SIMULTANEOUS SPEECH ENHANCEMENT AND DEREVERBERATION
Fu, Yihui
Liu, Yun
Li, Jingdong
Luo, Dawei
Lv, Shubo
Jv, Yukai
Xie, Lei
[J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7417 - 7421
[3] Graves A, 2012, STUD COMPUT INTELL, V385, P1, DOI [10.1162/neco.1997.9.1.1, 10.1007/978-3-642-24797-2]
[4] Ioffe S., 2015, 32 INT C MACH LEARN
[5] Kingma D. P., 2014, arXiv
[6] Li Q., 2021, ARXIV210204629
[7] Lv S., 2021, ARXIV210608672
[8] S-DCCRN: SUPER WIDE BAND DCCRN WITH LEARNABLE COMPLEX FEATURE FOR SPEECH ENHANCEMENT
Lv, Shubo
Fu, Yihui
Xing, Mengtao
Sun, Jiayao
Xie, Lei
Huang, Jun
Wang, Yannan
Yu, Tao
[J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7767 - 7771
[9] SIGNAL RECONSTRUCTION FROM SHORT-TIME FOURIER-TRANSFORM MAGNITUDE
NAWAB, SH
QUATIERI, TF
LIM, JS
[J]. IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1983, 31 (04): : 986 - 998
[10] Park S. R., 2016, arXiv preprint arXiv:1609.07132

← 1 2 3 →