TACOTRON-BASED ACOUSTIC MODEL USING PHONEME ALIGNMENT FOR PRACTICAL NEURAL TEXT-TO-SPEECH SYSTEMS

被引:0
作者
Okamoto, Takuma [1 ]
Toda, Tomoki [1 ,2 ]
Shiga, Yoshinori [1 ]
Kawai, Hisashi [1 ]
机构
[1] Natl Inst Informat & Commun Technol, Tokyo, Japan
[2] Nagoya Univ, Informat Technol Ctr, Nagoya, Aichi, Japan
来源
2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019) | 2019年
关键词
Speech synthesis; neural text-to-speech; duration model; forced alignment; sequence-to-sequence model; ATTENTION;
D O I
10.1109/asru46091.2019.9003956
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Although sequence-to-sequence (seq2seq) models with attention mechanism in neural text-to-speech (TTS) systems, such as Tacotron 2, can jointly optimize duration and acoustic models, and realize high-fidelity synthesis compared with conventional durationacoustic pipeline models, these involve a risk that speech samples cannot be sometimes successfully synthesized due to the attention prediction errors. Therefore, these seq2seq models cannot be directly introduced in practical TTS systems. On the other hand, the conventional pipeline models are broadly used in practical TTS systems since there are few crucial prediction errors in the duration model. For realizing high-quality practical TTS systems without attention prediction errors, this paper investigates Tacotron-based acoustic models with phoneme alignment instead of attention. The phoneme durations are first obtained from HMM-based forced alignment and the duration model is a simple bidirectional LSTM-based network. Then, a seq2seq model with forced alignment instead of attention is investigated and an alternative model with Tacotron decoder and phoneme duration is proposed. The results of experiments with full-context label input using WaveGlow vocoder indicate that the proposed model can realize a high-fidelity TTS system for Japanese with a real-time factor of 0.13 using a GPU without attention prediction errors compared with the seq2seq models.
引用
收藏
页码:214 / 221
页数:8
相关论文
共 61 条
  • [51] Autoregressive Neural F0 Model for Statistical Parametric Speech Synthesis
    Wang, Xin
    Takaki, Shinji
    Yamagishi, Junichi
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2018, 26 (08) : 1406 - 1419
  • [52] Wang X, 2017, INT CONF ACOUST SPEE, P4895, DOI 10.1109/ICASSP.2017.7953087
  • [53] Wang Y, 2018, PROCEEDINGS OF THE 2018 EURO-ASIA CONFERENCE ON ENVIRONMENT AND CSR: TOURISM, SOCIETY AND EDUCATION SESSION (PART II), P167
  • [54] Tacotron: Towards End-to-End Speech Synthesis
    Wang, Yuxuan
    Skerry-Ryan, R. J.
    Stanton, Daisy
    Wu, Yonghui
    Weiss, Ron J.
    Jaitly, Navdeep
    Yang, Zongheng
    Xiao, Ying
    Chen, Zhifeng
    Bengio, Samy
    Quoc Le
    Agiomyrgiannakis, Yannis
    Clark, Rob
    Saurous, Rif A.
    [J]. 18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 4006 - 4010
  • [55] Wu Zhizheng, 2016, SSW, DOI [10.21437/ssw.2016-33, 10.21437/SSW.2016-33, DOI 10.21437/SSW.2016-33]
  • [56] Yang S, 2019, INT CONF ACOUST SPEE, P6910, DOI 10.1109/ICASSP.2019.8682861
  • [57] Yasuda Y, 2019, INT CONF ACOUST SPEE, P6905, DOI [10.1109/icassp.2019.8682353, 10.1109/ICASSP.2019.8682353]
  • [58] Zen HG, 2015, INT CONF ACOUST SPEE, P4470, DOI 10.1109/ICASSP.2015.7178816
  • [59] Zhang JX, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P4789, DOI 10.1109/ICASSP.2018.8462020
  • [60] Zhang YJ, 2019, INT CONF ACOUST SPEE, P6945, DOI [10.1109/ICASSP.2019.8683623, 10.1109/icassp.2019.8683623]