PROSODIC REPRESENTATION LEARNING AND CONTEXTUAL SAMPLING FOR NEURAL TEXT-TO-SPEECH

被引:10
|
作者
Karlapati, Sri [1 ]
Abbas, Ammar [1 ]
Hodari, Zack [2 ]
Moinet, Alexis [1 ]
Joly, Arnaud [1 ]
Karanasou, Penny [1 ]
Drugman, Thomas [1 ]
机构
[1] Amazon Res, Cambridge, England
[2] Univ Edinburgh, Ctr Speech Technol Res, Edinburgh, Midlothian, Scotland
来源
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021) | 2021年
关键词
TTS; prosody modelling; contextual prosody;
D O I
10.1109/ICASSP39728.2021.9413696
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
In this paper, we introduce Kathaka, a model trained with a novel two-stage training process for neural speech synthesis with contextually appropriate prosody. In Stage I, we learn a prosodic distribution at the sentence level from mel-spectrograms available during training. In Stage II, we propose a novel method to sample from this learnt prosodic distribution using the contextual information available in text. To do this, we use BERT on text, and graph-attention networks on parse trees extracted from text. We show a statistically significant relative improvement of 13.2% in naturalness over a strong baseline when compared to recordings. We also conduct an ablation study on variations of our sampling technique, and show a statistically significant improvement over the baseline in each case.
引用
收藏
页码:6573 / 6577
页数:5
相关论文
共 50 条
  • [21] A new Korean corpus-based text-to-speech system
    Kim S.
    Lee Y.
    Hirose K.
    International Journal of Speech Technology, 2002, 5 (02) : 105 - 116
  • [22] Combining Text-to-Speech Services with Conventional Voiceover for News Oralization
    Afonso, Marcelo
    Almeida, Pedro
    APPLICATIONS AND USABILITY OF INTERACTIVE TV, JAUTI 2022, 2023, 1820 : 68 - 79
  • [23] Learning prosodic patterns for mandarin speech synthesis
    Chen, YQ
    Gao, W
    Zhu, TS
    Ling, C
    JOURNAL OF INTELLIGENT INFORMATION SYSTEMS, 2002, 19 (01) : 95 - 109
  • [24] Multilingual Text-to-Speech Synthesis for Turkic Languages Using Transliteration
    Yeshpanov, Rustem
    Mussakhojayeva, Saida
    Khassanov, Yerbolat
    INTERSPEECH 2023, 2023, : 5521 - 5525
  • [25] Learning Prosodic Patterns for Mandarin Speech Synthesis
    Yiqiang Chen
    Wen Gao
    Tingshao Zhu
    Charles Ling
    Journal of Intelligent Information Systems, 2002, 19 : 95 - 109
  • [26] Language-Agnostic Meta-Learning for Low-Resource Text-to-Speech with Articulatory Features
    Lux, Florian
    Vu, Ngoc Thang
    PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 6858 - 6868
  • [27] Meta-TTS: Meta-Learning for Few-Shot Speaker Adaptive Text-to-Speech
    Huang, Sung-Feng
    Lin, Chyi-Jiunn
    Liu, Da-Rong
    Chen, Yi-Chen
    Lee, Hung-yi
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 1558 - 1571
  • [28] Algorithms for Speech Segmentation at Syllable-Level for Text-to-Speech Synthesis System in Gujarati
    Patil, Hemant A.
    Patel, Tanvina
    Talesara, Swati
    Shah, Nirmesh
    Sailor, Hardik
    Vachhani, Bhavik
    Akhani, Janki
    Kanakiya, Bhargav
    Gaur, Yashesh
    Prajapati, Vibha
    2013 INTERNATIONAL CONFERENCE ORIENTAL COCOSDA HELD JOINTLY WITH 2013 CONFERENCE ON ASIAN SPOKEN LANGUAGE RESEARCH AND EVALUATION (O-COCOSDA/CASLRE), 2013,
  • [29] Investigations on speaker adaptation using a continuous vocoder within recurrent neural network based text-to-speech synthesis
    Ali Raheem Mandeel
    Mohammed Salah Al-Radhi
    Tamás Gábor Csapó
    Multimedia Tools and Applications, 2023, 82 : 15635 - 15649
  • [30] Investigations on speaker adaptation using a continuous vocoder within recurrent neural network based text-to-speech synthesis
    Mandeel, Ali Raheem
    Al-Radhi, Mohammed Salah
    Csapo, Tamas Gabor
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (10) : 15635 - 15649