H4C-TTS: Leveraging Multi-Modal Historical Context for Conversational Text-to-Speech

被引:0
作者
Seong, Donghyun [1 ]
Chang, Joon-Hyuk [1 ]
机构
[1] Hanyang Univ, Dept Elect Engn, Seoul, South Korea
来源
INTERSPEECH 2024 | 2024年
基金
新加坡国家研究基金会;
关键词
Text-to-speech; conversational speech synthesis; multi-modal;
D O I
10.21437/Interspeech.2024-1480
中图分类号
学科分类号
摘要
Conversational text-to-speech (TTS) aims to synthesize natural voices appropriate to a situation by considering the context of past conversations as well as the current text. However, analyzing and modeling the context of a conversation remains challenging. Most conversational TTS use the content of historical and recent conversations without distinguishing between them and often generate speech that does not fit the situation. Hence, we introduce a novel conversational TTS, H4C-TTS, that leverages multi-modal historical context to realize contextually appropriate natural speech synthesis. To facilitate conversational context modeling, we design a context encoder that incorporates historical and recent contexts and a multi-modal encoder that processes textual and acoustic inputs. Experimental results demonstrate that the proposed model significantly improves the naturalness and quality of speech in conversational contexts compared with existing conversational TTS.
引用
收藏
页码:4933 / 4937
页数:5
相关论文
共 33 条
  • [1] ONE TTS ALIGNMENT TO RULE THEM ALL
    Badlani, Rohan
    Lancucki, Adrian
    Shih, Kevin J.
    Valle, Rafael
    Ping, Wei
    Catanzaro, Bryan
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6092 - 6096
  • [2] Baevski A., 2020, ADV NEURAL INFORM PR, V33, P12449
  • [3] Busso C., 2008, IEMOCAP: Interactive emotional dyadic motion capture database
  • [4] Christophe V., 2017, CSTR VCTK Corpus: English multi-speaker corpus for CSTR voice cloning toolkit
  • [5] CONVERSATIONAL END-TO-END TTS FOR VOICE AGENTS
    Guo, Haohan
    Zhang, Shaofei
    Soong, Frank K.
    He, Lei
    Xie, Lei
    [J]. 2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 403 - 409
  • [6] Huang RJ, 2022, Arxiv, DOI arXiv:2205.07211
  • [7] Ito K., 2017, The ljspeech dataset
  • [8] Keon L., 2023, PRC INT C AC, P1
  • [9] Kim Jaehyeon, 2020, ADV NEURAL INFORM PR, V33, P8067
  • [10] Kingma D. P., 2015, ICLR