Word-level Text Markup for Prosody Control in Speech Synthesis

被引：0

作者：

Korotkova, Yuliya ^{[1
,2
]}

Kalinovskiy, Ilya ^{[1
,3
]}

Vakhrusheva, Tatiana ^{[1
,2
]}

机构：

[1] JustAI, St Petersburg, Russia

[2] Higher Sch Econ, Moscow, Russia

[3] Tomsk Polytech Univ, Sch Comp Sci & Robot, Tomsk, Russia

来源：

INTERSPEECH 2024 | 2024年

关键词：

prosody control; prosody tagging; word-level prosody; speech synthesis; TTS;

D O I：

10.21437/Interspeech.2024-715

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Modern Text-to-Speech (TTS) technologies generate speech very close to the natural one, but synthesized voices still lack variation in intonation which, in addition, is hard to control. In this work, we address the problem of prosody control, aiming to capture information about intonation in a markup without hand-labeling and linguistic expertise. We propose a method of encoding prosodic knowledge from textual and acoustic modalities, which are obtained with the help of models pretrained on self-supervised tasks, into latent quantized space with interpretable features. Based on these features, the prosodic markup is constructed, and it is used as an additional input to the TTS model to solve the one-to-many problem and is predicted by text. Moreover, this method allows for prosody control during inference time and scalability to new data and other languages.

引用

页码：2280 / 2284

页数：5

共 50 条

[31] Speech Synthesis for Bangla Text to Speech Conversion [J].

Arafat, Mohammad Yasir ;

Fahrin, Sanjana ;

Islam, Md. Jamirul ;

Siddiquee, Md. Ashraf ;

Khan, Afsana ;

Kotwal, Mohammed Rokibul Alam ;

Huda, Mohammad Nurul .

8TH INTERNATIONAL CONFERENCE ON SOFTWARE, KNOWLEDGE, INFORMATION MANAGEMENT AND APPLICATIONS (SKIMA 2014), 2014,

[32] EXACT PROSODY CLONING IN ZERO-SHOT MULTISPEAKER TEXT-TO-SPEECH [J].

Lux, Florian ;

Koch, Julia ;

Vu, Ngoc Thang .

2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, :962-969

[33] Intonation Control for Neural Text-to-Speech Synthesis with Polynomial Models of F0 [J].

Corkey, Niamh ;

O'Mahony, Johannah ;

King, Simon .

INTERSPEECH 2023, 2023, :2014-2015

[34] PRESENT: Zero-Shot Text-to-Prosody Control [J].

Lam, Perry ;

Zhang, Huayun ;

Chen, Nancy F. ;

Sisman, Berrak ;

Herremans, Dorien .

IEEE SIGNAL PROCESSING LETTERS, 2025, 32 :776-780

[35] Recent Trends in Text to Speech Synthesis of Indian Languages [J].

Joshi, Sarang L. ;

Bairagi, Vinayak K. .

HELIX, 2019, 9 (03) :4931-4936

[36] ON THE INTERPLAY BETWEEN SPARSITY, NATURALNESS, INTELLIGIBILITY, AND PROSODY IN SPEECH SYNTHESIS [J].

Lai, Cheng-I Jeff ;

Cooper, Erica ;

Zhang, Yang ;

Chang, Shiyu ;

Qian, Kaizhi ;

Liao, Yi-Lun ;

Chuang, Yung-Sung ;

Liu, Alexander H. ;

Yamagishi, Junichi ;

Cox, David ;

Glass, James .

2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, :8447-8451

[37] INVESTIGATING DISENTANGLEMENT IN A PHONEME-LEVEL SPEECH CODEC FOR PROSODY MODELING [J].

Karapiperis, Sotirios ;

Ellinas, Nikolaos ;

Vioni, Alexandra ;

Oh, Junkwang ;

Jho, Gunu ;

Hwang, Inchul ;

Raptis, Spyros .

2024 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2024, :668-674

[38] MEASURING THE EFFECT OF LINGUISTIC RESOURCES ON PROSODY MODELING FOR SPEECH SYNTHESIS [J].

Rosenberg, Andrew ;

Fernandez, Raul ;

Ramabhadran, Bhuvana .

2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, :5114-5118

[39] Feedback Loop for Prosody Prediction in Concatenative Speech Synthesis. [J].

Latorre, Javier ;

Gracia, Sergio ;

Akamine, Masami .

INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, :2027-2030

[40] ProZed: A speech prosody analysis-by-synthesis tool for linguists [J].

Hirst, Daniel .

PROCEEDINGS OF THE 6TH INTERNATIONAL CONFERENCE ON SPEECH PROSODY, VOLS I AND II, 2012, :15-18

← 1 2 3 4 5 →