Word-level Text Markup for Prosody Control in Speech Synthesis

被引:0
|
作者
Korotkova, Yuliya [1 ,2 ]
Kalinovskiy, Ilya [1 ,3 ]
Vakhrusheva, Tatiana [1 ,2 ]
机构
[1] JustAI, St Petersburg, Russia
[2] Higher Sch Econ, Moscow, Russia
[3] Tomsk Polytech Univ, Sch Comp Sci & Robot, Tomsk, Russia
来源
INTERSPEECH 2024 | 2024年
关键词
prosody control; prosody tagging; word-level prosody; speech synthesis; TTS;
D O I
10.21437/Interspeech.2024-715
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Modern Text-to-Speech (TTS) technologies generate speech very close to the natural one, but synthesized voices still lack variation in intonation which, in addition, is hard to control. In this work, we address the problem of prosody control, aiming to capture information about intonation in a markup without hand-labeling and linguistic expertise. We propose a method of encoding prosodic knowledge from textual and acoustic modalities, which are obtained with the help of models pretrained on self-supervised tasks, into latent quantized space with interpretable features. Based on these features, the prosodic markup is constructed, and it is used as an additional input to the TTS model to solve the one-to-many problem and is predicted by text. Moreover, this method allows for prosody control during inference time and scalability to new data and other languages.
引用
收藏
页码:2280 / 2284
页数:5
相关论文
共 50 条
  • [1] UNSUPERVISED WORD-LEVEL PROSODY TAGGING FOR CONTROLLABLE SPEECH SYNTHESIS
    Guo, Yiwei
    Du, Chenpeng
    Yu, Kai
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7597 - 7601
  • [2] The Phonetics of Paiwan Word-Level Prosody
    Chen, Chun-Mei
    LANGUAGE AND LINGUISTICS, 2009, 10 (03) : 593 - 625
  • [3] Prosody Aware Word-level Encoder Based on BLSTM-RNNs for DNN-based Speech Synthesis
    Ijima, Yusuke
    Hojo, Nobukatsu
    Masumura, Ryo
    Asami, Taichi
    18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 764 - 768
  • [4] Extracting and Predicting Word-Level Style Variations for Speech Synthesis
    Zhang, Ya-Jie
    Ling, Zhen-Hua
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 (29) : 1582 - 1593
  • [5] INTERACTIVE MULTI-LEVEL PROSODY CONTROL FOR EXPRESSIVE SPEECH SYNTHESIS
    Cornille, Tobias
    Wang, Fengna
    Bekker, Jessa
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 8312 - 8316
  • [6] Two-Stage Prosody Prediction for Emotional Text-to-Speech Synthesis
    Tang, Hao
    Zhou, Xi
    Odisio, Matthias
    Hasegawa-Johnson, Mark
    Huang, Thomas S.
    INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, 2008, : 2138 - 2141
  • [7] Integrating Discrete Word-Level Style Variations into Non-Autoregressive Acoustic Models for Speech Synthesis
    Liu, Zhaoci
    Wu, Ningqian
    Zhang, Yajie
    Ling, Zhenhua
    INTERSPEECH 2022, 2022, : 5508 - 5512
  • [8] Combining conversational speech with read speech to improve prosody in Text-to-Speech synthesis
    O'Mahony, Johannah
    Lai, Catherine
    King, Simon
    INTERSPEECH 2022, 2022, : 3388 - 3392
  • [9] Polyglot Speech Prosody Control
    Romsdorfer, Harald
    INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, : 504 - 507
  • [10] Prosody Learning Mechanism for Speech Synthesis System Without Text Length Limit
    Zeng, Zhen
    Wang, Jianzong
    Cheng, Ning
    Xiao, Jing
    INTERSPEECH 2020, 2020, : 4422 - 4426