PROSODIC REPRESENTATION LEARNING AND CONTEXTUAL SAMPLING FOR NEURAL TEXT-TO-SPEECH

被引:10
作者
Karlapati, Sri [1 ]
Abbas, Ammar [1 ]
Hodari, Zack [2 ]
Moinet, Alexis [1 ]
Joly, Arnaud [1 ]
Karanasou, Penny [1 ]
Drugman, Thomas [1 ]
机构
[1] Amazon Res, Cambridge, England
[2] Univ Edinburgh, Ctr Speech Technol Res, Edinburgh, Midlothian, Scotland
来源
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021) | 2021年
关键词
TTS; prosody modelling; contextual prosody;
D O I
10.1109/ICASSP39728.2021.9413696
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
In this paper, we introduce Kathaka, a model trained with a novel two-stage training process for neural speech synthesis with contextually appropriate prosody. In Stage I, we learn a prosodic distribution at the sentence level from mel-spectrograms available during training. In Stage II, we propose a novel method to sample from this learnt prosodic distribution using the contextual information available in text. To do this, we use BERT on text, and graph-attention networks on parse trees extracted from text. We show a statistically significant relative improvement of 13.2% in naturalness over a strong baseline when compared to recordings. We also conduct an ablation study on variations of our sampling technique, and show a statistically significant improvement over the baseline in each case.
引用
收藏
页码:6573 / 6577
页数:5
相关论文
共 50 条
  • [31] Development of robotic voice conversion for RIBO using text-to-speech synthesis
    Hossain, Md. Jakir
    Al Amin, Sayed Mahmud
    Islam, Md. Saiful
    Marium-E-Jannat
    2018 4TH INTERNATIONAL CONFERENCE ON ELECTRICAL ENGINEERING AND INFORMATION & COMMUNICATION TECHNOLOGY (ICEEICT), 2018, : 422 - 425
  • [32] Two-Stage Prosody Prediction for Emotional Text-to-Speech Synthesis
    Tang, Hao
    Zhou, Xi
    Odisio, Matthias
    Hasegawa-Johnson, Mark
    Huang, Thomas S.
    INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, 2008, : 2138 - 2141
  • [33] An Improved Syllabification for a Better Malay Language Text-to-Speech Synthesis (TTS)
    Ramlia, Izzad
    Jamil, Nursuriati
    Seman, Noraini
    Ardi, Norizah
    2015 IEEE INTERNATIONAL SYMPOSIUM ON ROBOTICS AND INTELLIGENT SENSORS (IEEE IRIS2015), 2015, 76 : 417 - 424
  • [34] KazakhTTS: An Open-Source Kazakh Text-to-Speech Synthesis Dataset
    Mussakhojayeva, Saida
    Janaliyeva, Aigerim
    Mirzakhmetov, Almas
    Khassanov, Yerbolat
    Varol, Huseyin Atakan
    INTERSPEECH 2021, 2021, : 2786 - 2790
  • [35] TTS-SA (A Text-to-Speech System based on Standard Arabic)
    Hanane, Tebbi
    Maamar, Hamadouche
    Hamid, Azzoune
    2014 FOURTH INTERNATIONAL CONFERENCE ON DIGITAL INFORMATION AND COMMUNICATION TECHNOLOGY AND IT'S APPLICATIONS (DICTAP), 2014, : 337 - 341
  • [36] Humanoid Audio-Visual Avatar With Emotive Text-to-Speech Synthesis
    Tang, Hao
    Fu, Yun
    Tu, Jilin
    Hasegawa-Johnson, Mark
    Huang, Thomas S.
    IEEE TRANSACTIONS ON MULTIMEDIA, 2008, 10 (06) : 969 - 981
  • [37] A Novel Quasi-Diphone Inventory Approach to Text-To-Speech Synthesis
    Gerazov, Branislav
    Shutinoski, Goce
    Arsov, Goce
    2008 IEEE MEDITERRANEAN ELECTROTECHNICAL CONFERENCE, VOLS 1 AND 2, 2008, : 778 - 783
  • [38] An Advanced NLP Framework for High-Quality Text-to-Speech Synthesis
    Ungurean, Catalin
    Burileanu, Dragos
    2011 6TH CONFERENCE ON SPEECH TECHNOLOGY AND HUMAN-COMPUTER DIALOGUE (SPED), 2011,
  • [39] A Tool to Solve Sentence Segmentation Problem on Preparing Speech Database for Indonesian Text-to-speech System
    Uliniansyah, Mohammad Teduh
    Gunarso
    Nurfadhilah, Elvira
    Aini, Lyla Ruslana
    Junde, Juliati
    Ayuningtyas, Fara
    Santosa, Agung
    SLTU-2016 5TH WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGIES FOR UNDER-RESOURCED LANGUAGES, 2016, 81 : 188 - 193
  • [40] Adjusting Pleasure-Arousal-Dominance for Continuous Emotional Text-to-speech Synthesizer
    Rabiee, Azam
    Kim, Tae-Ho
    Lee, Soo-Young
    INTERSPEECH 2019, 2019, : 3693 - 3694