Channel and temporal-frequency attention UNet for monaural speech enhancement

被引：0

作者：

Shiyun Xu

Zehua Zhang

Mingjiang Wang

机构：

[1] Harbin Institute of Technology,Key Laboratory for Key Technologies of IoT Terminals

来源：

EURASIP Journal on Audio, Speech, and Music Processing | / 2023卷

关键词：

Speech enhancement; Neural network; Denoising; Dereverberation;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

The presence of noise and reverberation significantly impedes speech clarity and intelligibility. To mitigate these effects, numerous deep learning-based network models have been proposed for speech enhancement tasks aimed at improving speech quality. In this study, we propose a monaural speech enhancement model called the channel and temporal-frequency attention UNet (CTFUNet). CTFUNet takes the noisy spectrum as input and produces a complex ideal ratio mask (cIRM) as output. To improve the speech enhancement performance of CTFUNet, we employ multi-scale temporal-frequency processing to extract input speech spectrum features. We also utilize multi-conv head channel attention and residual channel attention to capture temporal-frequency and channel features. Moreover, we introduce the channel temporal-frequency skip connection to alleviate information loss between down-sampling and up-sampling. On the blind test set of the first deep noise suppression challenge, our proposed CTFUNet has better denoising performance than the champion models and the latest models. Furthermore, our model outperforms recent models such as Uformar and MTFAA in both denoising and dereverberation performance.

引用

共 36 条

[1]

Luo Y(2019)Conv-tasnet: Surpassing ideal time-frequency magnitude masking for speech separation IEEE/ACM Trans. Audio Speech Lang. Process. 27 1256-1266

[2]

Mesgarani N(2019)Learning complex spectral mapping with gated convolutional recurrent networks for monaural speech enhancement IEEE/ACM Trans. Audio Speech Lang. Process. 28 380-390

[3]

Tan K(2017)Long short-term memory for speaker generalization in supervised speech separation J. Acoust. Soc. Am. 141 4705-4714

[4]

Wang D(2021)Dense CNN with self-attention for time-domain speech enhancement IEEE/ACM Trans. Audio Speech Lang. Process. 29 1270-1279

[5]

Chen J(2020)Temporal convolutional neural (TCN) network for an effective weather forecasting using time-series data from the local weather station Soft Comput. 24 16453-16482

[6]

Wang D(2021)A nested u-net with self-attention and dense connectivity for monaural speech enhancement IEEE Signal Process. Lett. 29 105-109

[7]

Pandey A(2011)An algorithm for intelligibility prediction of time-frequency weighted noisy speech IEEE/ACM Trans. Audio Speech Lang. Process. 19 2125-2136

[8]

Wang D(2006)Performance measurement in blind audio source separation IEEE/ACM Trans. Audio Speech Lang. Process. 14 1462-1469

[9]

Hewage P(2010)A non-intrusive quality and intelligibility measure of reverberant and dereverberated speech IEEE Trans. Audio Speech Lang. Process. 18 1766-1774

[10]

Behera A(2009)Objective measures for predicting speech intelligibility in noisy conditions based on new band-importance functions J. Acoust. Soc. Am. 125 3387-405

← 1 2 3 4 →