PROSODIC REPRESENTATION LEARNING AND CONTEXTUAL SAMPLING FOR NEURAL TEXT-TO-SPEECH

被引:10
作者
Karlapati, Sri [1 ]
Abbas, Ammar [1 ]
Hodari, Zack [2 ]
Moinet, Alexis [1 ]
Joly, Arnaud [1 ]
Karanasou, Penny [1 ]
Drugman, Thomas [1 ]
机构
[1] Amazon Res, Cambridge, England
[2] Univ Edinburgh, Ctr Speech Technol Res, Edinburgh, Midlothian, Scotland
来源
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021) | 2021年
关键词
TTS; prosody modelling; contextual prosody;
D O I
10.1109/ICASSP39728.2021.9413696
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
In this paper, we introduce Kathaka, a model trained with a novel two-stage training process for neural speech synthesis with contextually appropriate prosody. In Stage I, we learn a prosodic distribution at the sentence level from mel-spectrograms available during training. In Stage II, we propose a novel method to sample from this learnt prosodic distribution using the contextual information available in text. To do this, we use BERT on text, and graph-attention networks on parse trees extracted from text. We show a statistically significant relative improvement of 13.2% in naturalness over a strong baseline when compared to recordings. We also conduct an ablation study on variations of our sampling technique, and show a statistically significant improvement over the baseline in each case.
引用
收藏
页码:6573 / 6577
页数:5
相关论文
共 50 条
  • [41] Comparing normalizing flows and diffusion models for prosody and acoustic modelling in text-to-speech
    Zhang, Guangyan
    Merritt, Thomas
    Ribeiro, Manuel Sam
    Tura-Vecino, Biel
    Yanagisawa, Kayoko
    Pokora, Kamil
    Ezzerg, Abdelhamid
    Cygert, Sebastian
    Abbas, Ammar
    Bilinski, Piotr
    Barra-Chicote, Roberto
    Korzekwa, Daniel
    Lorenzo-Trueba, Jaime
    INTERSPEECH 2023, 2023, : 27 - 31
  • [42] A Preliminary Study on Wav2Vec 2.0 Embeddings for Text-to-Speech
    Lim, Yohan
    Kim, Namhyeong
    Yun, Seung
    Kim, Hun
    Lee, Seung-Ik
    12TH INTERNATIONAL CONFERENCE ON ICT CONVERGENCE (ICTC 2021): BEYOND THE PANDEMIC ERA WITH ICT CONVERGENCE INNOVATION, 2021, : 343 - 347
  • [43] Spatial Speaker: 3D Java']Java Text-to-Speech Converter
    Sodnik, Jaka
    Tomazic, Saso
    WCECS 2009: WORLD CONGRESS ON ENGINEERING AND COMPUTER SCIENCE, VOLS I AND II, 2009, : 1306 - 1310
  • [44] End-to-End Text-To-Speech synthesis for under resourced South African languages
    Nthite, Thapelo
    Tsoeu, Mohohlo
    2020 INTERNATIONAL SAUPEC/ROBMECH/PRASA CONFERENCE, 2020, : 684 - 689
  • [45] FCL-TACO2: TOWARDS FAST, CONTROLLABLE AND LIGHTWEIGHT TEXT-TO-SPEECH SYNTHESIS
    Wang, Disong
    Deng, Liqun
    Zhang, Yang
    Zheng, Nianzu
    Yeung, Yu Ting
    Chen, Xiao
    Liu, Xunying
    Meng, Helen
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 5714 - 5718
  • [46] Improve Cross-Lingual Text-To-Speech Synthesis on Monolingual Corpora with Pitch Contour Information
    Zhan, Haoyue
    Zhang, Haitong
    Ou, Wenjie
    Lin, Yue
    INTERSPEECH 2021, 2021, : 1599 - 1603
  • [47] Towards a Vowel Formant Based Quality Metric for Text-to-Speech Systems: Measuring Monophthong Naturalness
    Albrecht, Sven
    Tamboli, Rewa
    Taubert, Stefan
    Eibl, Maximilian
    Diaeresis, Gunter
    Schmied, Josef
    2022 IEEE INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND VIRTUAL ENVIRONMENTS FOR MEASUREMENT SYSTEMS AND APPLICATIONS (IEEE CIVEMSA 2022), 2022,
  • [48] Natural Text-to-Speech Synthesis by Conditioning Spectrogram Predictions from Transformer Network on WaveGlow Vocoder
    Sanjay, G.
    Sooraj, K. C.
    Mishra, Deepak
    2020 7TH INTERNATIONAL CONFERENCE ON SOFT COMPUTING & MACHINE INTELLIGENCE (ISCMI 2020), 2020, : 255 - 259
  • [49] Choice of Voices: A Large-Scale Evaluation of Text-to-Speech Voice Quality for Long-Form Content
    Cambre, Julia
    Colnago, Jessica
    Maddock, Jim
    Tsai, Janice
    Kaye, Jofish
    PROCEEDINGS OF THE 2020 CHI CONFERENCE ON HUMAN FACTORS IN COMPUTING SYSTEMS (CHI'20), 2020,
  • [50] Cross-lingual Speaker Adaptation using Domain Adaptation and Speaker Consistency Loss for Text-To-Speech Synthesis
    Xin, Detai
    Saito, Yuki
    Takamichi, Shinnosuke
    Koriyama, Tomoki
    Saruwatari, Hiroshi
    INTERSPEECH 2021, 2021, : 1614 - 1618