Improving generative adversarial networks for speech enhancement through regularization of latent representations

被引:13
作者
Yang, Fan [1 ,2 ]
Wang, Ziteng [1 ,2 ]
Li, Junfeng [1 ,2 ]
Xia, Risheng [1 ]
Yan, Yonghong [1 ,2 ,3 ]
机构
[1] Chinese Acad Sci, Inst Acoust, Key Lab Speech Acoust & Content Understanding, Beijing 100190, Peoples R China
[2] Univ Chinese Acad Sci, Beijing 100049, Peoples R China
[3] Chinese Acad Sci, Xinjiang Tech Inst Phys & Chem, Xinjiang Lab Minor Speech & Language Informat Pro, Beijing, Peoples R China
关键词
Generative adversarial networks; End-to-end speech enhancement; Speech enhancement under low resources; NOISE;
D O I
10.1016/j.specom.2020.02.001
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Speech enhancement aims to improve the quality and intelligibility of speech signals, which is a challenging task in adverse environments. Speech enhancement generative adversarial network (SEGAN) that adopted a generative adversarial network (GAN) for speech enhancement achieved promising results. In this paper, a new network architecture and loss function based on SEGAN are proposed for speech enhancement. Different from most network structures applied in this field, the new network, called high-level GAN (HLGAN), uses parallel noisy and clean speech signals as input in the training phase instead of only noisy speech signals, which enables us to make full use of the information carried by the clean speech signals. Additionally, we introduce a new supervised speech representation loss, also known as high-level loss, in the middle hidden layer of the generative network. The high-level loss function is advantageous to HLGAN in speech enhancement under low signal-to-noise (SNR) environments and low-resource environments. We evaluate the performance of HLGAN over a wide range of experiments, in which our model produces significant improvements. Extensive experiments further demonstrate the generality of our model in a variety of speech enhancement cases. The issue of SEGAN losing speech components while removing noise in low SNR environments is improved. In addition, HLGAN can effectively enhance the speech signals of two low-resource languages simultaneously. The reasons for the superior performance of HLGAN are discussed.
引用
收藏
页码:1 / 9
页数:9
相关论文
共 37 条
  • [1] [Anonymous], P 862 2 WID TEL NETW
  • [2] [Anonymous], 2018, IEEE ACM T AUDIO SPE
  • [3] [Anonymous], COMPUT SCI
  • [4] [Anonymous], STAT VOICE ACTIVITY
  • [5] [Anonymous], 2015, P 16 ANN C INT SPEEC
  • [6] Arjovsky M, 2017, PR MACH LEARN RES, V70
  • [7] SUPPRESSION OF ACOUSTIC NOISE IN SPEECH USING SPECTRAL SUBTRACTION
    BOLL, SF
    [J]. IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1979, 27 (02): : 113 - 120
  • [8] SPEECH ENHANCEMENT FROM NOISE - A REGENERATIVE APPROACH
    DENDRINOS, M
    BAKAMIDIS, S
    CARAYANNIS, G
    [J]. SPEECH COMMUNICATION, 1991, 10 (01) : 45 - 57
  • [9] Donahue C, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P5024, DOI 10.1109/ICASSP.2018.8462581
  • [10] E Papamichalis P., 1987, PRACTICAL APPROACHES