UNSUPERVISED WORD-LEVEL PROSODY TAGGING FOR CONTROLLABLE SPEECH SYNTHESIS

被引:4
|
作者
Guo, Yiwei [1 ]
Du, Chenpeng [1 ]
Yu, Kai [1 ]
机构
[1] Shanghai Jiao Tong Univ, AI Inst, Dept Comp Sci & Engn, MoE Key Lab Artificial Intelligence,X LANCE Lab, Shanghai, Peoples R China
来源
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2022年
关键词
Prosody control; prosody tagging; word-level prosody; speech synthesis;
D O I
10.1109/ICASSP43922.2022.9746323
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Although word-level prosody modeling in neural text-to-speech (TTS) has been investigated in recent research for diverse speech synthesis, it is still challenging to control speech synthesis manually without a specific reference. This is largely due to lack of word-level prosody tags. In this work, we propose a novel approach for unsupervised word-level prosody tagging with two stages, where we first group the words into different types with a decision tree according to their phonetic content and then cluster the prosodies using GMM within each type of words separately. This design is based on the assumption that the prosodies of different type of words, such as long or short words, should be tagged with different label sets. Furthermore, a TTS system with the derived word-level prosody tags is trained for controllable speech synthesis. Experiments on LJSpeech show that the TTS model trained with word-level prosody tags not only achieves better naturalness than a typical FastSpeech2 model, but also gains the ability to manipulate word-level prosody.
引用
收藏
页码:7597 / 7601
页数:5
相关论文
共 50 条
  • [1] Word-level Text Markup for Prosody Control in Speech Synthesis
    Korotkova, Yuliya
    Kalinovskiy, Ilya
    Vakhrusheva, Tatiana
    INTERSPEECH 2024, 2024, : 2280 - 2284
  • [2] The Phonetics of Paiwan Word-Level Prosody
    Chen, Chun-Mei
    LANGUAGE AND LINGUISTICS, 2009, 10 (03) : 593 - 625
  • [3] Prosody Aware Word-level Encoder Based on BLSTM-RNNs for DNN-based Speech Synthesis
    Ijima, Yusuke
    Hojo, Nobukatsu
    Masumura, Ryo
    Asami, Taichi
    18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 764 - 768
  • [4] Extracting and Predicting Word-Level Style Variations for Speech Synthesis
    Zhang, Ya-Jie
    Ling, Zhen-Hua
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 (29) : 1582 - 1593
  • [5] Phone-Level Prosody Modelling With GMM-Based MDN for Diverse and Controllable Speech Synthesis
    Du, Chenpeng
    Yu, Kai
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 190 - 201
  • [6] Integrating Discrete Word-Level Style Variations into Non-Autoregressive Acoustic Models for Speech Synthesis
    Liu, Zhaoci
    Wu, Ningqian
    Zhang, Yajie
    Ling, Zhenhua
    INTERSPEECH 2022, 2022, : 5508 - 5512
  • [7] Controllable Neural Prosody Synthesis
    Morrison, Max
    Jin, Zeyu
    Salamon, Justin
    Bryan, Nicholas J.
    Mysore, Gautham J.
    INTERSPEECH 2020, 2020, : 4437 - 4441
  • [8] Prosody-controllable gender-ambiguous speech synthesis: a tool for investigating implicit bias in speech perception
    Szekely, Eva
    Gustafson, Joakim
    Torre, Ilaria
    INTERSPEECH 2023, 2023, : 1234 - 1238
  • [9] INTERACTIVE MULTI-LEVEL PROSODY CONTROL FOR EXPRESSIVE SPEECH SYNTHESIS
    Cornille, Tobias
    Wang, Fengna
    Bekker, Jessa
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 8312 - 8316
  • [10] Modeling Pauses for Synthesis of Storytelling Style Speech using Unsupervised Word Features
    Sarkar, Parakrant
    Rao, K. Sreenivasa
    SECOND INTERNATIONAL SYMPOSIUM ON COMPUTER VISION AND THE INTERNET (VISIONNET'15), 2015, 58 : 42 - 49