Word-level Text Markup for Prosody Control in Speech Synthesis

被引:0
作者
Korotkova, Yuliya [1 ,2 ]
Kalinovskiy, Ilya [1 ,3 ]
Vakhrusheva, Tatiana [1 ,2 ]
机构
[1] JustAI, St Petersburg, Russia
[2] Higher Sch Econ, Moscow, Russia
[3] Tomsk Polytech Univ, Sch Comp Sci & Robot, Tomsk, Russia
来源
INTERSPEECH 2024 | 2024年
关键词
prosody control; prosody tagging; word-level prosody; speech synthesis; TTS;
D O I
10.21437/Interspeech.2024-715
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Modern Text-to-Speech (TTS) technologies generate speech very close to the natural one, but synthesized voices still lack variation in intonation which, in addition, is hard to control. In this work, we address the problem of prosody control, aiming to capture information about intonation in a markup without hand-labeling and linguistic expertise. We propose a method of encoding prosodic knowledge from textual and acoustic modalities, which are obtained with the help of models pretrained on self-supervised tasks, into latent quantized space with interpretable features. Based on these features, the prosodic markup is constructed, and it is used as an additional input to the TTS model to solve the one-to-many problem and is predicted by text. Moreover, this method allows for prosody control during inference time and scalability to new data and other languages.
引用
收藏
页码:2280 / 2284
页数:5
相关论文
共 50 条
[41]   Probabilistic Amplitude Demodulation Features in Speech Synthesis for Improving Prosody [J].
Lazaridis, Alexandros ;
Cernak, Milos ;
Garner, Philip N. .
17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, :2298-2302
[42]   Prosody evaluation for embedded slovene speech-synthesis systems [J].
Mihelic, France ;
Vesnicer, Bostjan ;
Zibert, Janez ;
Noeth, Elmar .
INFORMACIJE MIDEM-JOURNAL OF MICROELECTRONICS ELECTRONIC COMPONENTS AND MATERIALS, 2007, 37 (03) :176-181
[43]   A statistical model with hierarchical structure for predicting prosody in a mandarin text-to-speech system [J].
Yu, MS ;
Pan, NH .
JOURNAL OF THE CHINESE INSTITUTE OF ENGINEERS, 2005, 28 (03) :385-399
[44]   LANGUAGE-INDEPENDENT PROSODY-ENHANCED SPEECH REPRESENTATIONS FOR MULTILINGUAL SPEECH SYNTHESIS [J].
Liu, Chang ;
Ling, Zhen-Hua ;
Hu, Ya-Jun .
2024 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2024, :482-488
[45]   REPETITION AND RE-START STRATEGIES FOR PROSODY IN TEXT-TO-SPEECH CONVERSION SYSTEMS [J].
LAVER, J .
SPEECH COMMUNICATION, 1993, 13 (1-2) :75-85
[46]   Prominence-Based Prosody Prediction for Unit Selection Speech Synthesis [J].
Windmann, Andreas ;
Jauk, Igor ;
Tamburini, Fabio ;
Wagner, Petra .
12TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2011 (INTERSPEECH 2011), VOLS 1-5, 2011, :332-+
[47]   DiffProsody: Diffusion-Based Latent Prosody Generation for Expressive Speech Synthesis With Prosody Conditional Adversarial Training [J].
Oh, Hyung-Seok ;
Lee, Sang-Hoon ;
Lee, Seong-Whan .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 :2654-2666
[48]   Technical and Phonetic Aspects of Speech Quality Assessment: The Case of Prosody Synthesis [J].
Tuckova, Jana ;
Holub, Jan ;
Dubeda, Tomas .
CROSS-MODAL ANALYSIS OF SPEECH, GESTURES, GAZE AND FACIAL EXPRESSIONS, 2009, 5641 :126-+
[49]   CLUSTERING OF DURATION PATTERNS IN SPEECH FOR TEXT-TO-SPEECH SYNTHESIS [J].
Sreelekshmi, K. S. ;
Gopinath, Deepa P. .
2012 ANNUAL IEEE INDIA CONFERENCE (INDICON), 2012, :1122-1127
[50]   Eye Tracking for the Online Evaluation of Prosody in Speech Synthesis: Not So Fast! [J].
White, Michael ;
Rajkumar, Rajakrishnan ;
Ito, Kiwako ;
Speer, Shari R. .
INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, :2491-2494