Fine-grained prosody modeling in neural speech synthesis using ToBI representation

被引:7
|
作者
Zou, Yuxiang [1 ]
Liu, Shichao [1 ]
Yin, Xiang [1 ]
Lin, Haopeng [1 ]
Wang, Chunfeng [1 ]
Zhang, Haoyu [1 ]
Ma, Zejun [1 ]
机构
[1] Bytedance AI Lab, Beijing, Peoples R China
来源
INTERSPEECH 2021 | 2021年
关键词
speech synthesis; ToBI representation; prosody modeling; intonation; stress and pause control;
D O I
10.21437/Interspeech.2021-883
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Benefiting from the great development of deep learning, modern neural text-to-speech (TTS) models can generate speech indistinguishable from natural speech. However, The generated utterances often keep an average prosodic style of the database instead of having rich prosodic variation. For pitch-stressed languages, such as English, accurate intonation and stress are important for conveying semantic information. In this work, we propose a fine-grained prosody modeling method in neural speech synthesis with ToBI (Tones and Break Indices) representation. The proposed system consists of a text frontend for ToBI prediction and a Tacotron-based TTS module for prosody modeling. By introducing the ToBI representation, we can control the system to synthesize speech with accurate intonation and stress at syllable level. Compared with the two baselines (Tacotron and unsupervised method), experiments show that our model can generate more natural speech with more accurate prosody, as well as effectively control the stress, intonation, and pause of the speech.
引用
收藏
页码:3146 / 3150
页数:5
相关论文
共 50 条
  • [1] MULTI-SPEAKER EMOTIONAL SPEECH SYNTHESIS WITH FINE-GRAINED PROSODY MODELING
    Lu, Chunhui
    Wen, Xue
    Liu, Ruolan
    Chen, Xiao
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 5729 - 5733
  • [2] ROBUST AND FINE-GRAINED PROSODY CONTROL OF END-TO-END SPEECH SYNTHESIS
    Lee, Younggun
    Kim, Taesu
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 5911 - 5915
  • [3] Towards Fine-Grained Prosody Control for Voice Conversion
    Lian, Zheng
    Zhong, Rongxiu
    Wen, Zhengqi
    Liu, Bin
    Tao, Jianhua
    2021 12TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2021,
  • [4] Fine-grained pitch processing of music and speech in congenital amusia
    Tillmann, Barbara
    Rusconi, Elena
    Traube, Caroline
    Butterworth, Brian
    Umilta, Carlo
    Peretz, Isabelle
    JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2011, 130 (06) : 4089 - 4096
  • [5] Fine-grained Style Modeling, Transfer and Prediction in Text-to-Speech Synthesis via Phone-Level Content-Style Disentanglement
    Tan, Daxin
    Lee, Tan
    INTERSPEECH 2021, 2021, : 4683 - 4687
  • [6] Towards Expressive Zero-Shot Speech Synthesis with Hierarchical Prosody Modeling
    Jiang, Yuepeng
    Li, Tao
    Yang, Fengyu
    Xie, Lei
    Menge, Meng
    Wang, Yujun
    INTERSPEECH 2024, 2024, : 2300 - 2304
  • [7] MEASURING THE EFFECT OF LINGUISTIC RESOURCES ON PROSODY MODELING FOR SPEECH SYNTHESIS
    Rosenberg, Andrew
    Fernandez, Raul
    Ramabhadran, Bhuvana
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 5114 - 5118
  • [8] SPEECH PROSODY CONTROL USING WEIGHTED NEURAL NETWORK ENSEMBLES
    Romsdorfer, Harald
    2009 IEEE INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING, 2009, : 299 - 304
  • [9] Cross-speaker Style Transfer with Prosody Bottleneck in Neural Speech Synthesis
    Pan, Shifeng
    He, Lei
    INTERSPEECH 2021, 2021, : 4678 - 4682
  • [10] Comparison of chironomic stylization versus statistical modeling of prosody for expressive speech synthesis
    Evrard, Marc
    Delalez, Samuel
    d'Alessandro, Christophe
    Rilliard, Albert
    16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, : 3370 - 3374