Deep encoder/decoder dual-path neural network for speech separation in noisy reverberation environments

被引:0
作者
Chunxi Wang
Maoshen Jia
Xinfeng Zhang
机构
[1] Beijing University of Technology,Faculty of Information Technology
来源
EURASIP Journal on Audio, Speech, and Music Processing | / 2023卷
关键词
Speech separation; Deep learning; Speech enhancement; SISNR;
D O I
暂无
中图分类号
学科分类号
摘要
In recent years, the speaker-independent, single-channel speech separation problem has made significant progress with the development of deep neural networks (DNNs). However, separating the speech of each interested speaker from an environment that includes the speech of other speakers, background noise, and room reverberation remains challenging. In order to solve this problem, a speech separation method for a noisy reverberation environment is proposed. Firstly, the time-domain end-to-end network structure of a deep encoder/decoder dual-path neural network is introduced in this paper for speech separation. Secondly, to make the model not fall into local optimum during training, a loss function stretched optimal scale-invariant signal-to-noise ratio (SOSISNR) was proposed, inspired by the scale-invariant signal-to-noise ratio (SISNR). At the same time, in order to make the training more appropriate to the human auditory system, the joint loss function is extended based on short-time objective intelligibility (STOI). Thirdly, an alignment operation is proposed to reduce the influence of time delay caused by reverberation on separation performance. Combining the above methods, the subjective and objective evaluation metrics show that this study has better separation performance in complex sound field environments compared to the baseline methods.
引用
收藏
相关论文
共 70 条
[1]  
Bronkhorst AW(2000)The cocktail party phenomenon: a review of research on speech intelligibility in multiple-talker conditions Acta Acust. Acust. 86 117-128
[2]  
Haykin S(2005)The cocktail party problem Neural Comput. 17 1875-1902
[3]  
Chen Z(1979)Suppression of acoustic noise in speech using spectral subtraction IEEE Transactions on Acoustics Speech and Signal Processing 27 113-120
[4]  
Boll SF(2021)Multiple sound source separation via ideal ratio masking by using probability mixture model J. Signal Process. 37 1806-1815
[5]  
Jia Y(2015)Reverberant speech separation with probabilistic time frequency masking for b-format recordings Speech Communications. 68 41-54
[6]  
Yang Q(2018)Separation of multiple speech sources by recovering sparse and non-sparse components from B-format microphone recordings Speech Commun. 96 184-196
[7]  
Jia M(2015)Joint optimization of masks and deep recurrent neural networks for monaural source separation IEEE/ACM Transactions on Audio, Speech, and Language Processing, Dec. 23 2136-2147
[8]  
Xu W(2016)Complex ratio masking for monaural speech separation IEEE/ACM Transactions on Audio, Speech, and Language Processing 24 483-492
[9]  
Bao C(2017)Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks IEEE/ACM Transactions on Audio, Speech, and Language Processing, Oct. 25 1901-1913
[10]  
Chen X(2018)Speaker-independent speech separation with deep attractor network IEEE/ACM Transactions on Audio, Speech, and Language Processing 26 787-796