ROBUST AND FINE-GRAINED PROSODY CONTROL OF END-TO-END SPEECH SYNTHESIS

被引:0
作者
Lee, Younggun [1 ]
Kim, Taesu [1 ]
机构
[1] Neosapience Inc, Seoul, South Korea
来源
2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2019年
关键词
Prosody; Speech style; Speech synthesis; Text-to-speech;
D O I
10.1109/icassp.2019.8683501
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
We propose prosody embeddings for emotional and expressive speech synthesis networks. The proposed methods introduce temporal structures in the embedding networks, thus enabling fine-grained control of the speaking style of the synthesized speech. The temporal structures can be designed either on the speech side or the text side, leading to different control resolutions in time. The prosody embedding networks are plugged into end-to-end speech synthesis networks and trained without any other supervision except for the target speech for synthesizing. It is demonstrated that the prosody embedding networks learned to extract prosodic features. By adjusting the learned prosody features, we could change the pitch and amplitude of the synthesized speech both at the frame level and the phoneme level. We also introduce the temporal normalization of prosody embeddings, which shows better robustness against speaker perturbations during prosody transfer tasks.
引用
收藏
页码:5911 / 5915
页数:5
相关论文
共 13 条
  • [1] [Anonymous], ARXIV180801410
  • [2] [Anonymous], ARXIV180703247
  • [3] [Anonymous], 2018, P 35 INT C MACH LEAR
  • [4] [Anonymous], 2017, NEURIPS
  • [5] Arik SÖ, 2017, ADV NEUR IN, V30
  • [6] Cho Kyunghyun, 2014, C EMPIRICAL METHODS, P1724
  • [7] Griffin D. W., 1983, Proceedings of ICASSP 83. IEEE International Conference on Acoustics, Speech and Signal Processing, P804
  • [8] Shen J, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P4779, DOI 10.1109/ICASSP.2018.8461368
  • [9] Silverman K., 1992, INT C SPOKEN LANGUAG, DOI 10.21437/ICSLP.1992-260
  • [10] Wang YX, 2018, PR MACH LEARN RES, V80