UNSUPERVISED WORD-LEVEL PROSODY TAGGING FOR CONTROLLABLE SPEECH SYNTHESIS

被引:6
作者
Guo, Yiwei [1 ]
Du, Chenpeng [1 ]
Yu, Kai [1 ]
机构
[1] Shanghai Jiao Tong Univ, AI Inst, Dept Comp Sci & Engn, MoE Key Lab Artificial Intelligence,X LANCE Lab, Shanghai, Peoples R China
来源
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2022年
关键词
Prosody control; prosody tagging; word-level prosody; speech synthesis;
D O I
10.1109/ICASSP43922.2022.9746323
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Although word-level prosody modeling in neural text-to-speech (TTS) has been investigated in recent research for diverse speech synthesis, it is still challenging to control speech synthesis manually without a specific reference. This is largely due to lack of word-level prosody tags. In this work, we propose a novel approach for unsupervised word-level prosody tagging with two stages, where we first group the words into different types with a decision tree according to their phonetic content and then cluster the prosodies using GMM within each type of words separately. This design is based on the assumption that the prosodies of different type of words, such as long or short words, should be tagged with different label sets. Furthermore, a TTS system with the derived word-level prosody tags is trained for controllable speech synthesis. Experiments on LJSpeech show that the TTS model trained with word-level prosody tags not only achieves better naturalness than a typical FastSpeech2 model, but also gains the ability to manipulate word-level prosody.
引用
收藏
页码:7597 / 7601
页数:5
相关论文
共 50 条
[31]   Probabilistic Amplitude Demodulation Features in Speech Synthesis for Improving Prosody [J].
Lazaridis, Alexandros ;
Cernak, Milos ;
Garner, Philip N. .
17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, :2298-2302
[32]   Prosody evaluation for embedded slovene speech-synthesis systems [J].
Mihelic, France ;
Vesnicer, Bostjan ;
Zibert, Janez ;
Noeth, Elmar .
INFORMACIJE MIDEM-JOURNAL OF MICROELECTRONICS ELECTRONIC COMPONENTS AND MATERIALS, 2007, 37 (03) :176-181
[33]   LANGUAGE-INDEPENDENT PROSODY-ENHANCED SPEECH REPRESENTATIONS FOR MULTILINGUAL SPEECH SYNTHESIS [J].
Liu, Chang ;
Ling, Zhen-Hua ;
Hu, Ya-Jun .
2024 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2024, :482-488
[34]   TOWARDS UNSUPERVISED SPEECH RECOGNITION AND SYNTHESIS WITH QUANTIZED SPEECH REPRESENTATION LEARNING [J].
Liu, Alexander H. ;
Tu, Tao ;
Lee, Hung-yi ;
Lee, Lin-shan .
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, :7259-7263
[35]   Unsupervised features from text for speech synthesis in a speech-to-speech translation system [J].
Watts, Oliver ;
Zhou, Bowen .
12TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2011 (INTERSPEECH 2011), VOLS 1-5, 2011, :2164-2167
[36]   Prominence-Based Prosody Prediction for Unit Selection Speech Synthesis [J].
Windmann, Andreas ;
Jauk, Igor ;
Tamburini, Fabio ;
Wagner, Petra .
12TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2011 (INTERSPEECH 2011), VOLS 1-5, 2011, :332-+
[37]   DiffProsody: Diffusion-Based Latent Prosody Generation for Expressive Speech Synthesis With Prosody Conditional Adversarial Training [J].
Oh, Hyung-Seok ;
Lee, Sang-Hoon ;
Lee, Seong-Whan .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 :2654-2666
[38]   Technical and Phonetic Aspects of Speech Quality Assessment: The Case of Prosody Synthesis [J].
Tuckova, Jana ;
Holub, Jan ;
Dubeda, Tomas .
CROSS-MODAL ANALYSIS OF SPEECH, GESTURES, GAZE AND FACIAL EXPRESSIONS, 2009, 5641 :126-+
[39]   Eye Tracking for the Online Evaluation of Prosody in Speech Synthesis: Not So Fast! [J].
White, Michael ;
Rajkumar, Rajakrishnan ;
Ito, Kiwako ;
Speer, Shari R. .
INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, :2491-2494
[40]   A modular holistic approach to prosody modelling for Standard Yoruba speech synthesis [J].
Qdejobi, Odetunji A. ;
Wong, Shun Ha Sylvia ;
Beaumont, Anthony J. .
COMPUTER SPEECH AND LANGUAGE, 2008, 22 (01) :39-68