Acoustic Modeling for End-to-End Empathetic Dialogue Speech Synthesis Using Linguistic and Prosodic Contexts of Dialogue History

被引:6
作者
Nishimura, Yuto [1 ]
Saito, Yuki [1 ]
Takamichi, Shinnosuke [1 ]
Tachibana, Kentaro [2 ]
Saruwatari, Hiroshi [1 ]
机构
[1] Univ Tokyo, Tokyo, Japan
[2] LINE Corp, Tokyo, Japan
来源
INTERSPEECH 2022 | 2022年
关键词
speech synthesis; spoken dialogue; dialogue speech synthesis; empathy; contexts;
D O I
10.21437/Interspeech.2022-403
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
We propose an end-to-end empathetic dialogue speech synthesis (DSS) model that considers both the linguistic and prosodic contexts of dialogue history. Empathy is the active attempt by humans to get inside the interlocutor in dialogue, and empathetic DSS is a technology to implement this act in spoken dialogue systems. Our model is conditioned by the history of linguistic and prosody features for predicting appropriate dialogue context. As such, it can be regarded as an extension of the conventional linguistic-feature-based dialogue history modeling. To train the empathetic DSS model effectively, we investigate 1) a self-supervised learning model pretrained with large speech corpora, 2) a style-guided training using a prosody embedding of the current utterance to be predicted by the dialogue context embedding, 3) a cross-modal attention to combine text and speech modalities, and 4) a sentence-wise embedding to achieve fine-grained prosody modeling rather than utterancewise modeling. The evaluation results demonstrate that 1) simply considering prosodic contexts of the dialogue history does not improve the quality of speech in empathetic DSS and 2) introducing style-guided training and sentence-wise embedding modeling achieves higher speech quality than that by the conventional method.
引用
收藏
页码:3373 / 3377
页数:5
相关论文
共 8 条
  • [1] Acoustic Word Embeddings for End-to-End Speech Synthesis
    Shen, Feiyu
    Du, Chenpeng
    Yu, Kai
    APPLIED SCIENCES-BASEL, 2021, 11 (19):
  • [2] On the localness modeling for the self-attention based end-to-end speech synthesis
    Yang, Shan
    Lu, Heng
    Kang, Shiyin
    Xue, Liumeng
    Xiao, Jinba
    Su, Dan
    Xie, Lei
    Yu, Dong
    NEURAL NETWORKS, 2020, 125 : 121 - 130
  • [3] Lhasa-Tibetan Speech Synthesis Using End-to-End Model
    Zhao, Yue
    Hu, Panhua
    Xu, Xiaona
    Wu, Licheng
    Li, Xiali
    IEEE ACCESS, 2019, 7 (140305-140311) : 140305 - 140311
  • [4] USING SPEECH SYNTHESIS TO TRAIN END-TO-END SPOKEN LANGUAGE UNDERSTANDING MODELS
    Lugosch, Loren
    Meyer, Brett H.
    Nowrouzezahrai, Derek
    Ravanelli, Mirco
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 8499 - 8503
  • [5] End-to-End Video-to-Speech Synthesis Using Generative Adversarial Networks
    Mira, Rodrigo
    Vougioukas, Konstantinos
    Ma, Pingchuan
    Petridis, Stavros
    Schuller, Bjoern W.
    Pantic, Maja
    IEEE TRANSACTIONS ON CYBERNETICS, 2023, 53 (06) : 3454 - 3466
  • [6] Memory Attention: Robust Alignment Using Gating Mechanism for End-to-End Speech Synthesis
    Lee, Joun Yeop
    Cheon, Sung Jun
    Choi, Byoung Jin
    Kim, Nam Soo
    IEEE SIGNAL PROCESSING LETTERS, 2020, 27 : 2004 - 2008
  • [7] MIST-Tacotron: End-to-End Emotional Speech Synthesis Using Mel-Spectrogram Image Style Transfer
    Moon, Sungwoo
    Kim, Sunghyun
    Choi, Yong-Hoon
    IEEE ACCESS, 2022, 10 : 25455 - 25463
  • [8] Speaker Adaptation Experiments with Limited Data for End-to-End Text-To-Speech Synthesis using Tacotron2
    Mandeel, Ali Raheem
    Al-Radhi, Mohammed Salah
    Csapo, Tamas Gabor
    INFOCOMMUNICATIONS JOURNAL, 2022, 14 (03): : 55 - 62