Channel and temporal-frequency attention UNet for monaural speech enhancement

被引：12

作者：

Xu, Shiyun ^{[1
]}

Zhang, Zehua ^{[1
]}

Wang, Mingjiang ^{[1
]}

机构：

[1] Harbin Inst Technol, Key Lab Key Technol IoT Terminals, Shenzhen, Peoples R China

来源：

EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING | 2023年 / 2023卷 / 01期

基金：

中国国家自然科学基金;

关键词：

Speech enhancement; Neural network; Denoising; Dereverberation; SELF-ATTENTION; INTELLIGIBILITY; REVERBERANT;

D O I：

10.1186/s13636-023-00295-6

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

The presence of noise and reverberation significantly impedes speech clarity and intelligibility. To mitigate these effects, numerous deep learning-based network models have been proposed for speech enhancement tasks aimed at improving speech quality. In this study, we propose a monaural speech enhancement model called the channel and temporal-frequency attention UNet (CTFUNet). CTFUNet takes the noisy spectrum as input and produces a complex ideal ratio mask (cIRM) as output. To improve the speech enhancement performance of CTFUNet, we employ multi-scale temporal-frequency processing to extract input speech spectrum features. We also utilize multi-conv head channel attention and residual channel attention to capture temporal-frequency and channel features. Moreover, we introduce the channel temporal-frequency skip connection to alleviate information loss between down-sampling and up-sampling. On the blind test set of the first deep noise suppression challenge, our proposed CTFUNet has better denoising performance than the champion models and the latest models. Furthermore, our model outperforms recent models such as Uformar and MTFAA in both denoising and dereverberation performance.

引用

页数：14

共 55 条

[1]

[Anonymous], 2005, 8622 INT TEL UN

[2]

Bai SJ, 2018, Arxiv, DOI [arXiv:1803.01271, DOI 10.48550/ARXIV.1803.01271, 10.48550/arXiv.1803.01271]

[3] SUPPRESSION OF ACOUSTIC NOISE IN SPEECH USING SPECTRAL SUBTRACTION [J].

BOLL, SF .

IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1979, 27 (02) :113-120

[4] Long short-term memory for speaker generalization in supervised speech separation [J].

Chen, Jitong ;

Wang, DeLiang .

JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2017, 141 (06) :4705-4714

[5] Speech Enhancement with Fullband-Subband Cross-Attention Network [J].

Chen, Jun ;

Rao, Wei ;

Wang, Zilin ;

Wu, Zhiyong ;

Wang, Yannan ;

Yu, Tao ;

Shang, Shidong ;

Meng, Helen .

INTERSPEECH 2022, 2022, :976-980

[6] FullSubNet plus : CHANNEL ATTENTION FULLSUBNET WITH COMPLEX SPECTROGRAMS FOR SPEECH ENHANCEMENT [J].

Chen, Jun ;

Wang, Zilin ;

Tuo, Deyi ;

Wu, Zhiyong ;

Kang, Shiyin ;

Meng, Helen .

2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, :7857-7861

[7] REAL-TIME DENOISING AND DEREVERBERATION WTIH TINY RECURRENT U-NET [J].

Choi, Hyeong-Seok ;

Park, Sungjin ;

Lee, Jie Hwan ;

Heo, Hoon ;

Jeon, Dongsuk ;

Lee, Kyogu .

2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, :5789-5793

[8] SPEECH ENHANCEMENT USING A MINIMUM MEAN-SQUARE ERROR LOG-SPECTRAL AMPLITUDE ESTIMATOR [J].

EPHRAIM, Y ;

MALAH, D .

IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1985, 33 (02) :443-445

[9] SPEECH ENHANCEMENT USING A MINIMUM MEAN-SQUARE ERROR SHORT-TIME SPECTRAL AMPLITUDE ESTIMATOR [J].

EPHRAIM, Y ;

MALAH, D .

IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1984, 32 (06) :1109-1121

[10] A Non-Intrusive Quality and Intelligibility Measure of Reverberant and Dereverberated Speech [J].

Falk, Tiago H. ;

Zheng, Chenxi ;

Chan, Wai-Yip .

IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2010, 18 (07) :1766-1774

← 1 2 3 4 5 6 →