Narrator or Character: Voice Modulation in an Expressive Multi-speaker TTS

被引：1

作者：

Kalyan, T. Pavan ^{[1
]}

Rao, Preeti ^{[1
]}

Jyothi, Preethi ^{[1
]}

Bhattacharyya, Pushpak ^{[1
]}

机构：

[1] Indian Inst Technol, Mumbai, Maharashtra, India

来源：

INTERSPEECH 2023 | 2023年

关键词：

Expressive TTS; speech synthesis; new TTS corpus; prosody modelling;

D O I：

10.21437/Interspeech.2023-2469

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Current Text-to-Speech (TTS) systems are trained on audiobook data and perform well in synthesizing read-style speech. In this work, we are interested in synthesizing audio stories as narrated to children. The storytelling style is more expressive and requires perceptible changes of voice across the narrator and story characters. To address these challenges, we present a new TTS corpus of English audio stories for children with 32.7 hours of speech by a single female speaker with a UK accent. We provide evidence of the salient differences in the suprasegmentals of the narrator and character utterances in the dataset, motivating the use of a multi-speaker TTS for our application. We use a fine-tuned BERT model to label each sentence as being spoken by a narrator or character that is subsequently used to condition the TTS output. Experiments show our new TTS system is superior in expressiveness in both A-B preference and MOS testing compared to reading-style TTS and single-speaker TTS.

引用

页码：4808 / 4812

页数：5

共 44 条

[41] Multi-speaker Emotion Conversion via Latent Variable Regularization and a Chained Encoder-Decoder-Predictor Network
Shankar, Ravi
Hsieh, Hsi-Wei
Charon, Nicolas
Venkataraman, Archana
INTERSPEECH 2020, 2020, : 3391 - 3395
[42] LNACont: Language-normalized Affine Coupling Layer with contrastive learning for Cross-lingual Multi-speaker Text-to-speech
Hwang, Sungwoong
Kim, Changhwan
32ND EUROPEAN SIGNAL PROCESSING CONFERENCE, EUSIPCO 2024, 2024, : 391 - 395
[43] Wasserstein GAN and Waveform Loss-Based Acoustic Model Training for Multi-Speaker Text-to-Speech Synthecis Systems Using a WaveNet Vocoder
Zhao, Yi
Takaki, Shinji
Luong, Hieu-Thi
Yamagishi, Junichi
Saito, Daisuke
Minematsu, Nobuaki
IEEE ACCESS, 2018, 6 : 60478 - 60488
[44] U-Style: Cascading U-Nets With Multi-Level Speaker and Style Modeling for Zero-Shot Voice Cloning
Li, Tao
Wang, Zhichao
Zhu, Xinfa
Cong, Jian
Tian, Qiao
Wang, Yuping
Xie, Lei
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 4026 - 4035

← 1 2 3 4 5 →