Statistical Parametric Speech Synthesis Incorporating Generative Adversarial Networks

被引:141
|
作者
Saito, Yuki [1 ]
Takamichi, Shinnosuke [1 ]
Saruwatari, Hiroshi [1 ]
机构
[1] Univ Tokyo, Grad Sch Informat Sci & Technol, Tokyo 1138656, Japan
关键词
Statistical parametric speech synthesis; text-to-speech synthesis; voice conversion; deep neural networks; generative adversarial networks; over-smoothing;
D O I
10.1109/TASLP.2017.2761547
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
A method for statistical parametric speech synthesis incorporating generative adversarial networks (GANs) is proposed. Although powerful deep neural networks techniques can be applied to artificially synthesize speech waveform, the synthetic speech quality is low compared with that of natural speech. One of the issues causing the quality degradation is an oversmoothing effect often observed in the generated speech parameters. A GAN introduced in this paper consists of two neural networks: a discriminator to distinguish natural and generated samples, and a generator to deceive the discriminator. In the proposed framework incorporating the GANs, the discriminator is trained to distinguish natural and generated speech parameters, while the acoustic models are trained to minimize the weighted sum of the conventional minimum generation loss and an adversarial loss for deceiving the discriminator. Since the objective of the GANs is to minimize the divergence (i.e., distribution difference) between the natural and generated speech parameters, the proposed method effectively alleviates the oversmoothing effect on the generated speech parameters. We evaluated the effectiveness for text-to-speech and voice conversion, and found that the proposed method can generate more natural spectral parameters and F-0 than conventional minimum generation error training algorithm regardless of its hyperparameter settings. Furthermore, we investigated the effect of the divergence of various GANs, and found that a Wasserstein GAN minimizing the Earth-Mover's distance works the best in terms of improving the synthetic speech quality.
引用
收藏
页码:84 / 96
页数:13
相关论文
共 50 条
  • [1] GENERATIVE ADVERSARIAL NETWORK-BASED POSTFILTER FOR STATISTICAL PARAMETRIC SPEECH SYNTHESIS
    Kaneko, Takuhiro
    Kameoka, Hirokazu
    Hojo, Nobukatsu
    Ijima, Yusuke
    Hiramatsu, Kaoru
    Kashino, Kunio
    2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2017, : 4910 - 4914
  • [2] STATISTICAL PARAMETRIC SPEECH SYNTHESIS USING GENERATIVE ADVERSARIAL NETWORKS UNDER A MULTI-TASK LEARNING FRAMEWORK
    Yang, Shan
    Xie, Lei
    Chen, Xiao
    Lou, Xiaoyan
    Zhu, Xuan
    Huang, Dongyan
    Li, Haizhou
    2017 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2017, : 685 - 691
  • [3] Generative adversarial network-based glottal waveform model for statistical parametric speech synthesis
    Bollepalli, Bajibabu
    Juvela, Lauri
    Alku, Paavo
    18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 3394 - 3398
  • [4] A Deep Generative Architecture for Postfiltering in Statistical Parametric Speech Synthesis
    Chen, Ling-Hui
    Raitio, Tuomo
    Valentini-Botinhao, Cassia
    Ling, Zhen-Hua
    Yamagishi, Junichi
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2015, 23 (11) : 2003 - 2014
  • [5] SPEECH WAVEFORM SYNTHESIS FROM MFCC SEQUENCES WITH GENERATIVE ADVERSARIAL NETWORKS
    Juvela, Lauri
    Bollepalli, Bajibabu
    Wang, Xin
    Kameoka, Hirokazu
    Airaksinen, Manu
    Yamagishi, Junichi
    Alku, Paavo
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 5679 - 5683
  • [6] THE EFFECT OF NEURAL NETWORKS IN STATISTICAL PARAMETRIC SPEECH SYNTHESIS
    Hashimoto, Kei
    Oura, Keiichiro
    Nankaku, Yoshihiko
    Tokuda, Keiichi
    2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP), 2015, : 4455 - 4459
  • [7] Generative adversarial networks for speech processing: A review
    Wali, Aamir
    Alamgir, Zareen
    Karim, Saira
    Fawaz, Ather
    Ali, Mubariz Barkat
    Adan, Muhammad
    Mujtaba, Malik
    COMPUTER SPEECH AND LANGUAGE, 2022, 72
  • [8] Speech Loss Compensation by Generative Adversarial Networks
    Shi, Yupeng
    Zheng, Nengheng
    Kang, Yuyong
    Rong, Weicong
    2019 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2019, : 347 - 351
  • [9] STATISTICAL PARAMETRIC SPEECH SYNTHESIS USING DEEP NEURAL NETWORKS
    Zen, Heiga
    Senior, Andrew
    Schuster, Mike
    2013 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2013, : 7962 - 7966
  • [10] DIRECTLY MODELING SPEECH WAVEFORMS BY NEURAL NETWORKS FOR STATISTICAL PARAMETRIC SPEECH SYNTHESIS
    Tokuda, Keiichi
    Zen, Heiga
    2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP), 2015, : 4215 - 4219