Expressive Text-to-Speech using Style Tag

被引:10
|
作者
Kim, Minchan [1 ,2 ]
Cheon, Sung Jun [1 ,2 ]
Choi, Byoung Jin [1 ,2 ]
Kim, Jong Jin [3 ]
Kim, Nam Soo [1 ,2 ]
机构
[1] Seoul Natl Univ, Dept Elect & Comp Engn, Seoul, South Korea
[2] Seoul Natl Univ, INMC, Seoul, South Korea
[3] SK Telecom, Seoul, South Korea
来源
INTERSPEECH 2021 | 2021年
关键词
speech synthesis; expressive TTS; language model; non-autoregressive TTS;
D O I
10.21437/Interspeech.2021-465
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
As recent text-to-speech (TTS) systems have been rapidly improved in speech quality and generation speed, many researchers now focus on a more challenging issue: expressive TTS. To control speaking styles, existing expressive TTS models use categorical style index or reference speech as style input. In this work, we propose StyleTagging-TTS (ST-TTS), a novel expressive TTS model that utilizes a style tag written in natural language. Using a style-tagged TTS dataset and a pre-trained language model, we modeled the relationship between linguistic embedding and speaking style domain, which enables our model to work even with style tags unseen during training. As style tag is written in natural language, it can control speaking style in a more intuitive, interpretable, and scalable way compared with style index or reference speech. In addition, in terms of model architecture, we propose an efficient non-autoregressive (NAR) TTS architecture with single-stage training. The experimental result shows that ST-TTS outperforms the existing expressive TTS model, Tacotron2-GST in speech quality and expressiveness.
引用
收藏
页码:4663 / 4667
页数:5
相关论文
共 50 条
  • [1] Expressive Text-to-Speech Synthesis using Text Chat Dataset with Speaking Style Information
    Homma Y.
    Kanagawa H.
    Kobayashi N.
    Ijima Y.
    Saito K.
    Transactions of the Japanese Society for Artificial Intelligence, 2023, 38 (03)
  • [2] LLM-based Expressive Text-to-Speech Synthesizer with Style and Timbre disentanglement
    Zhu, Yuanyuan
    He, Jiaxu
    Jing, Ruihao
    Song, Yaodong
    Lian, Jie
    Zhang, Xiao-Lei
    Li, Jie
    2024 IEEE 14TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING, ISCSLP 2024, 2024, : 596 - 600
  • [3] Expressive Visual Text-To-Speech Using Active Appearance Models
    Anderson, Robert
    Stenger, Bjoern
    Wan, Vincent
    Cipolla, Roberto
    2013 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2013, : 3382 - 3389
  • [4] ON GRANULARITY OF PROSODIC REPRESENTATIONS IN EXPRESSIVE TEXT-TO-SPEECH
    Babianski, Mikolaj
    Pokora, Kamil
    Shah, Raahil
    Sienkiewicz, Rafal
    Korzekwa, Daniel
    Klimkov, Viacheslav
    2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 892 - 899
  • [5] Modeling the Acoustic Correlates of Expressive Elements in Text Genres for Expressive Text-to-Speech Synthesis
    Yang, Hongwu
    Meng, Helen M.
    Cai, Lianhong
    INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, 2006, : 1806 - 1809
  • [6] LOW-RESOURCE EXPRESSIVE TEXT-TO-SPEECH USING DATA AUGMENTATION
    Huybrechts, Goeric
    Merritt, Thomas
    Comini, Giulia
    Perz, Bartek
    Shah, Raahil
    Lorenzo-Trueba, Jaime
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6593 - 6597
  • [7] TSP-TTS: Text-based Style Predictor with Residual Vector Quantization for Expressive Text-to-Speech
    Seong, Donghyun
    Lee, Hoyoung
    Chang, Joon-Hyuk
    INTERSPEECH 2024, 2024, : 1780 - 1784
  • [8] CALM: Contrastive Cross-modal Speaking Style Modeling for Expressive Text-to-Speech Synthesis
    Meng, Yi
    Li, Xiang
    Wu, Zhiyong
    Li, Tingtian
    Sun, Zixun
    Xiao, Xinyu
    Sun, Chi
    Zhan, Hui
    Meng, Helen
    INTERSPEECH 2022, 2022, : 5533 - 5537
  • [9] ContextSpeech: Expressive and Efficient Text-to-Speech for Paragraph Reading
    Xiao, Yujia
    Zhang, Shaofei
    Wang, Xi
    Tan, Xu
    He, Lei
    Zhao, Sheng
    Soong, Frank K.
    Lee, Tan
    INTERSPEECH 2023, 2023, : 4883 - 4887
  • [10] Enhancing Speech Intelligibility in Text-To-Speech Synthesis using Speaking Style Conversion
    Paul, Dipjyoti
    Shifas, Muhammed P., V
    Pantazis, Yannis
    Stylianou, Yannis
    INTERSPEECH 2020, 2020, : 1361 - 1365