PhaseDCN: A Phase-Enhanced Dual-Path Dilated Convolutional Network for Single-Channel Speech Enhancement

被引:20
作者
Zhang, Lu [1 ]
Wang, Mingjiang [1 ]
Zhang, Qiquan [1 ]
Wang, Xinsheng [2 ]
Liu, Ming [3 ]
机构
[1] Harbin Inst Technol, Dept Elect & Informat Engn, Shenzhen 518000, Peoples R China
[2] Harbin Inst Technol, Dept Informat Sci & Engn, Weihai 264200, Peoples R China
[3] Shenzhen Inst Informat Technol, Shenzhen 518116, Peoples R China
关键词
Speech enhancement; Noise measurement; Convolution; Feature extraction; Signal to noise ratio; Neural networks; Interference; complex spectrum; dilated convolutional network; multi-scale; multi-target learning; NEURAL-NETWORKS; NOISE; SEPARATION;
D O I
10.1109/TASLP.2021.3092585
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Recent deep neural network (DNN) based single-channel speech enhancement methods have achieved remarkable results in the time-frequency (TF) magnitude domain. To further improve the quality and intelligibility of enhanced speech, the attention to phase enhancement is also increasing. In this paper, we propose a novel dilated convolutional network (DCN) model to simultaneously enhance the magnitude and phase of noisy speech. Unlike the direct complex spectral mapping methods, we take the complex spectrum of the signal as the main target and the ideal ratio mask (IRM) as the auxiliary target in a multi-target learning framework to achieve their complementary advantages. Firstly, a feature extraction module is introduced to achieve the fusion of local and long-term features. Two different targets are learned separately, but share the common feature extraction module, which is helpful to extract more general and suitable features. During the joint learning, the intermediate estimation of the IRM target in the auxiliary path, contributing as the attention gating factors, helps to distinguish the speech or non-speech components of the complex-valued signals in the main path. To leverage more fine-grained long-term contextual information, we introduce a multi-scale dilated convolution approach for feature encoding. Moreover, the proposed model is a causal system, which can fully meet the low latency requirements of real-time speech products. Experimental results show that, compared with other advanced systems, the proposed model not only has better speech denoising performance and phase estimation accuracy, but also generalizes better in the speaker, noise, and channel mismatch cases.
引用
收藏
页码:2561 / 2574
页数:14
相关论文
共 63 条
[1]   Convolutional Neural Networks for Speech Recognition [J].
Abdel-Hamid, Ossama ;
Mohamed, Abdel-Rahman ;
Jiang, Hui ;
Deng, Li ;
Penn, Gerald ;
Yu, Dong .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2014, 22 (10) :1533-1545
[2]   Further intelligibility results from human listening tests using the short-time phase spectrum [J].
Alsteris, Leigh D. ;
Paliwal, Kuldip K. .
SPEECH COMMUNICATION, 2006, 48 (06) :727-736
[3]  
Amodei D, 2016, PR MACH LEARN RES, V48
[4]  
[Anonymous], 2014, 15 ANN C INT SPEECH
[5]  
[Anonymous], NIST speech disc 1-1.1
[6]  
[Anonymous], 2016, SSW
[7]  
Benesty J., 2007, SPRINGER HDB SPEECH, P843
[8]   A Feature Study for Classification-Based Speech Separation at Low Signal-to-Noise Ratios [J].
Chen, Jitong ;
Wang, Yuxuan ;
Wang, DeLiang .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2014, 22 (12) :1993-2002
[9]   Noise spectrum estimation in adverse environments: Improved minima controlled recursive averaging [J].
Cohen, I .
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 2003, 11 (05) :466-475
[10]   Speech enhancement for non-stationary noise environments [J].
Cohen, I ;
Berdugo, B .
SIGNAL PROCESSING, 2001, 81 (11) :2403-2418