PROSODIC REPRESENTATION LEARNING AND CONTEXTUAL SAMPLING FOR NEURAL TEXT-TO-SPEECH

被引：10

作者：

Karlapati, Sri ^{[1
]}

Abbas, Ammar ^{[1
]}

Hodari, Zack ^{[2
]}

Moinet, Alexis ^{[1
]}

Joly, Arnaud ^{[1
]}

Karanasou, Penny ^{[1
]}

Drugman, Thomas ^{[1
]}

机构：

[1] Amazon Res, Cambridge, England

[2] Univ Edinburgh, Ctr Speech Technol Res, Edinburgh, Midlothian, Scotland

来源：

2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021) | 2021年

关键词：

TTS; prosody modelling; contextual prosody;

D O I：

10.1109/ICASSP39728.2021.9413696

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

In this paper, we introduce Kathaka, a model trained with a novel two-stage training process for neural speech synthesis with contextually appropriate prosody. In Stage I, we learn a prosodic distribution at the sentence level from mel-spectrograms available during training. In Stage II, we propose a novel method to sample from this learnt prosodic distribution using the contextual information available in text. To do this, we use BERT on text, and graph-attention networks on parse trees extracted from text. We show a statistically significant relative improvement of 13.2% in naturalness over a strong baseline when compared to recordings. We also conduct an ablation study on variations of our sampling technique, and show a statistically significant improvement over the baseline in each case.

引用

页码：6573 / 6577

页数：5

共 50 条

[21] A new Korean corpus-based text-to-speech system
Kim S.
Lee Y.
Hirose K.
International Journal of Speech Technology, 2002, 5 (02) : 105 - 116
[22] Combining Text-to-Speech Services with Conventional Voiceover for News Oralization
Afonso, Marcelo
Almeida, Pedro
APPLICATIONS AND USABILITY OF INTERACTIVE TV, JAUTI 2022, 2023, 1820 : 68 - 79
[23] Learning prosodic patterns for mandarin speech synthesis
Chen, YQ
Gao, W
Zhu, TS
Ling, C
JOURNAL OF INTELLIGENT INFORMATION SYSTEMS, 2002, 19 (01) : 95 - 109
[24] Multilingual Text-to-Speech Synthesis for Turkic Languages Using Transliteration
Yeshpanov, Rustem
Mussakhojayeva, Saida
Khassanov, Yerbolat
INTERSPEECH 2023, 2023, : 5521 - 5525
[25] Learning Prosodic Patterns for Mandarin Speech Synthesis
Yiqiang Chen
Wen Gao
Tingshao Zhu
Charles Ling
Journal of Intelligent Information Systems, 2002, 19 : 95 - 109
[26] Language-Agnostic Meta-Learning for Low-Resource Text-to-Speech with Articulatory Features
Lux, Florian
Vu, Ngoc Thang
PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 6858 - 6868
[27] Meta-TTS: Meta-Learning for Few-Shot Speaker Adaptive Text-to-Speech
Huang, Sung-Feng
Lin, Chyi-Jiunn
Liu, Da-Rong
Chen, Yi-Chen
Lee, Hung-yi
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 1558 - 1571
[28] Algorithms for Speech Segmentation at Syllable-Level for Text-to-Speech Synthesis System in Gujarati
Patil, Hemant A.
Patel, Tanvina
Talesara, Swati
Shah, Nirmesh
Sailor, Hardik
Vachhani, Bhavik
Akhani, Janki
Kanakiya, Bhargav
Gaur, Yashesh
Prajapati, Vibha
2013 INTERNATIONAL CONFERENCE ORIENTAL COCOSDA HELD JOINTLY WITH 2013 CONFERENCE ON ASIAN SPOKEN LANGUAGE RESEARCH AND EVALUATION (O-COCOSDA/CASLRE), 2013,
[29] Investigations on speaker adaptation using a continuous vocoder within recurrent neural network based text-to-speech synthesis
Ali Raheem Mandeel
Mohammed Salah Al-Radhi
Tamás Gábor Csapó
Multimedia Tools and Applications, 2023, 82 : 15635 - 15649
[30] Investigations on speaker adaptation using a continuous vocoder within recurrent neural network based text-to-speech synthesis
Mandeel, Ali Raheem
Al-Radhi, Mohammed Salah
Csapo, Tamas Gabor
MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (10) : 15635 - 15649

← 1 2 3 4 5 →