On Loss Functions and Recurrency Training for GAN-based Speech Enhancement Systems

被引：23

作者：

Zhang, Zhuohuang ^{[1
,2
]}

Deng, Chengyun ^{[3
]}

Shen, Yi ^{[1
]}

Williamson, Donald S. ^{[2
]}

Sha, Yongtao ^{[3
]}

Zhang, Yi ^{[3
]}

Song, Hui ^{[3
]}

Li, Xiangang ^{[3
]}

机构：

[1] Indiana Univ, Dept Speech Language & Hearing Sci, Bloomington, IN 47405 USA

[2] Indiana Univ, Dept Comp Sci, Bloomington, IN 47405 USA

[3] Didi Chuxing, Beijing, Peoples R China

来源：

INTERSPEECH 2020 | 2020年

关键词：

speech enhancement; generative adversarial networks; convolutional recurrent neural network; NEURAL-NETWORK;

D O I：

10.21437/Interspeech.2020-1169

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

Recent work has shown that it is feasible to use generative adversarial networks (GANs) for speech enhancement, however, these approaches have not been compared to state-of-the-art (SOTA) non GAN-based approaches. Additionally, many loss functions have been proposed for GAN-based approaches, but they have not been adequately compared. In this study, we propose novel convolutional recurrent GAN (CRGAN) architectures for speech enhancement. Multiple loss functions are adopted to enable direct comparisons to other GAN-based systems. The benefits of including recurrent layers are also explored. Our results show that the proposed CRGAN model outperforms the SOTA GAN-based models using the same loss functions and it outperforms other non-GAN based systems, indicating the benefits of using a GAN for speech enhancement. Overall, the CRGAN model that combines an objective metric loss function with the mean squared error (MSE) provides the best performance over comparison approaches across many evaluation metrics.

引用

页码：3266 / 3270

页数：5

共 37 条

[1]

[Anonymous], 2017, P INT C LEARN REPR

[2]

[Anonymous], 2016, ARXIV161107174

[3]

Baby D, 2019, INT CONF ACOUST SPEE, P106, DOI [10.1109/icassp.2019.8683799, 10.1109/ICASSP.2019.8683799]

[4]

Clevert D.-A., 2016, Fast and accurate deep network learning by exponential linear units (elus)

[5]

Erdogan H, 2015, INT CONF ACOUST SPEE, P708, DOI 10.1109/ICASSP.2015.7178061

[6]

FU SW, 2019, ARXIV190504874

[7]

Goodfellow IJ, 2014, ADV NEUR IN, V27, P2672

[8] Framewise phoneme classification with bidirectional LSTM and other neural network architectures [J].

Graves, A ;

Schmidhuber, J .

NEURAL NETWORKS, 2005, 18 (5-6) :602-610

[9] Neural Spatial Filter: Target Speaker Speech Separation Assisted with Directional Information [J].

Gu, Rongzhi ;

Chen, Lianwu ;

Zhang, Shi-Xiong ;

Zheng, Jimeng ;

Xu, Yong ;

Yu, Meng ;

Su, Dan ;

Zou, Yuexian ;

Yu, Dong .

INTERSPEECH 2019, 2019, :4290-4294

[10]

Hansen J. H., 1991, IEEE T SIGNAL PROCES, V39, P795

← 1 2 3 4 →