Narrator or Character: Voice Modulation in an Expressive Multi-speaker TTS

被引:1
|
作者
Kalyan, T. Pavan [1 ]
Rao, Preeti [1 ]
Jyothi, Preethi [1 ]
Bhattacharyya, Pushpak [1 ]
机构
[1] Indian Inst Technol, Mumbai, Maharashtra, India
来源
INTERSPEECH 2023 | 2023年
关键词
Expressive TTS; speech synthesis; new TTS corpus; prosody modelling;
D O I
10.21437/Interspeech.2023-2469
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Current Text-to-Speech (TTS) systems are trained on audiobook data and perform well in synthesizing read-style speech. In this work, we are interested in synthesizing audio stories as narrated to children. The storytelling style is more expressive and requires perceptible changes of voice across the narrator and story characters. To address these challenges, we present a new TTS corpus of English audio stories for children with 32.7 hours of speech by a single female speaker with a UK accent. We provide evidence of the salient differences in the suprasegmentals of the narrator and character utterances in the dataset, motivating the use of a multi-speaker TTS for our application. We use a fine-tuned BERT model to label each sentence as being spoken by a narrator or character that is subsequently used to condition the TTS output. Experiments show our new TTS system is superior in expressiveness in both A-B preference and MOS testing compared to reading-style TTS and single-speaker TTS.
引用
收藏
页码:4808 / 4812
页数:5
相关论文
共 44 条
  • [41] Multi-speaker Emotion Conversion via Latent Variable Regularization and a Chained Encoder-Decoder-Predictor Network
    Shankar, Ravi
    Hsieh, Hsi-Wei
    Charon, Nicolas
    Venkataraman, Archana
    INTERSPEECH 2020, 2020, : 3391 - 3395
  • [42] LNACont: Language-normalized Affine Coupling Layer with contrastive learning for Cross-lingual Multi-speaker Text-to-speech
    Hwang, Sungwoong
    Kim, Changhwan
    32ND EUROPEAN SIGNAL PROCESSING CONFERENCE, EUSIPCO 2024, 2024, : 391 - 395
  • [43] Wasserstein GAN and Waveform Loss-Based Acoustic Model Training for Multi-Speaker Text-to-Speech Synthecis Systems Using a WaveNet Vocoder
    Zhao, Yi
    Takaki, Shinji
    Luong, Hieu-Thi
    Yamagishi, Junichi
    Saito, Daisuke
    Minematsu, Nobuaki
    IEEE ACCESS, 2018, 6 : 60478 - 60488
  • [44] U-Style: Cascading U-Nets With Multi-Level Speaker and Style Modeling for Zero-Shot Voice Cloning
    Li, Tao
    Wang, Zhichao
    Zhu, Xinfa
    Cong, Jian
    Tian, Qiao
    Wang, Yuping
    Xie, Lei
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 4026 - 4035