Narrator or Character: Voice Modulation in an Expressive Multi-speaker TTS

被引：1

作者：

Kalyan, T. Pavan ^{[1
]}

Rao, Preeti ^{[1
]}

Jyothi, Preethi ^{[1
]}

Bhattacharyya, Pushpak ^{[1
]}

机构：

[1] Indian Inst Technol, Mumbai, Maharashtra, India

来源：

INTERSPEECH 2023 | 2023年

关键词：

Expressive TTS; speech synthesis; new TTS corpus; prosody modelling;

D O I：

10.21437/Interspeech.2023-2469

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Current Text-to-Speech (TTS) systems are trained on audiobook data and perform well in synthesizing read-style speech. In this work, we are interested in synthesizing audio stories as narrated to children. The storytelling style is more expressive and requires perceptible changes of voice across the narrator and story characters. To address these challenges, we present a new TTS corpus of English audio stories for children with 32.7 hours of speech by a single female speaker with a UK accent. We provide evidence of the salient differences in the suprasegmentals of the narrator and character utterances in the dataset, motivating the use of a multi-speaker TTS for our application. We use a fine-tuned BERT model to label each sentence as being spoken by a narrator or character that is subsequently used to condition the TTS output. Experiments show our new TTS system is superior in expressiveness in both A-B preference and MOS testing compared to reading-style TTS and single-speaker TTS.

引用

页码：4808 / 4812

页数：5

共 44 条

[31] Zero-Shot Normalization Driven Multi-Speaker Text to Speech Synthesis
Kumar, Neeraj
Narang, Ankur
Lall, Brejesh
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 1679 - 1693
[32] Phoneme Duration Modeling Using Speech Rhythm-Based Speaker Embeddings for Multi-Speaker Speech Synthesis
Fujita, Kenichi
Ando, Atsushi
Ijima, Yusuke
INTERSPEECH 2021, 2021, : 3141 - 3145
[33] SPEAKER AND LANGUAGE INDEPENDENT VOICE QUALITY CLASSIFICATION APPLIED TO UNLABELLED CORPORA OF EXPRESSIVE SPEECH
Kane, John
Scherer, Stefan
Aylett, Matthew
Morency, Louis-Philippe
Gobl, Christer
2013 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2013, : 7982 - 7986
[34] Speech Rhythm-Based Speaker Embeddings Extraction from Phonemes and Phoneme Duration for Multi-Speaker Speech Synthesis
Fujita, Kenichi
Ando, Atsushi
Ijima, Yusuke
IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2024, E107D (01) : 93 - 104
[35] Multi-speaker Multi-style Text-to-speech Synthesis with Single-speaker Single-style Training Data Scenarios
Xie, Qicong
Li, Tao
Wang, Xinsheng
Wang, Zhichao
Xie, Lei
Yu, Guoqiao
Wan, Guanglu
2022 13TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2022, : 66 - 70
[36] Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis
Hashimoto, Kei
Nankaku, Yoshihiko
Tokuda, Keiichi
12TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2011 (INTERSPEECH 2011), VOLS 1-5, 2011, : 120 - 123
[37] Lip2Speech: Lightweight Multi-Speaker Speech Reconstruction with Gabor Features
Dong, Zhongping
Xu, Yan
Abel, Andrew
Wang, Dong
APPLIED SCIENCES-BASEL, 2024, 14 (02):
[38] A Controllable Multi-Lingual Multi-Speaker Multi-Style Text-to-Speech Synthesis With Multivariate Information Minimization
Cheon, Sung Jun
Choi, Byoung Jin
Kim, Minchan
Lee, Hyeonseung
Kim, Nam Soo
IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 55 - 59
[39] FOCUSING ON ATTENTION: PROSODY TRANSFER AND ADAPTATIVE OPTIMIZATION STRATEGY FOR MULTI-SPEAKER END-TO-END SPEECH SYNTHESIS
Fu, Ruibo
Tao, Jianhua
Wen, Zhengqi
Yi, Jiangyan
Wang, Tao
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6709 - 6713
[40] SNAC: Speaker-Normalized Affine Coupling Layer in Flow-Based Architecture for Zero-Shot Multi-Speaker Text-to-Speech
Choi, Byoung Jin
Jeong, Myeonghun
Lee, Joun Yeop
Kim, Nam Soo
IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 2502 - 2506

← 1 2 3 4 5 →