Acoustic Modeling for End-to-End Empathetic Dialogue Speech Synthesis Using Linguistic and Prosodic Contexts of Dialogue History

被引：6

作者：

Nishimura, Yuto ^{[1
]}

Saito, Yuki ^{[1
]}

Takamichi, Shinnosuke ^{[1
]}

Tachibana, Kentaro ^{[2
]}

Saruwatari, Hiroshi ^{[1
]}

机构：

[1] Univ Tokyo, Tokyo, Japan

[2] LINE Corp, Tokyo, Japan

来源：

INTERSPEECH 2022 | 2022年

关键词：

speech synthesis; spoken dialogue; dialogue speech synthesis; empathy; contexts;

D O I：

10.21437/Interspeech.2022-403

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

We propose an end-to-end empathetic dialogue speech synthesis (DSS) model that considers both the linguistic and prosodic contexts of dialogue history. Empathy is the active attempt by humans to get inside the interlocutor in dialogue, and empathetic DSS is a technology to implement this act in spoken dialogue systems. Our model is conditioned by the history of linguistic and prosody features for predicting appropriate dialogue context. As such, it can be regarded as an extension of the conventional linguistic-feature-based dialogue history modeling. To train the empathetic DSS model effectively, we investigate 1) a self-supervised learning model pretrained with large speech corpora, 2) a style-guided training using a prosody embedding of the current utterance to be predicted by the dialogue context embedding, 3) a cross-modal attention to combine text and speech modalities, and 4) a sentence-wise embedding to achieve fine-grained prosody modeling rather than utterancewise modeling. The evaluation results demonstrate that 1) simply considering prosodic contexts of the dialogue history does not improve the quality of speech in empathetic DSS and 2) introducing style-guided training and sentence-wise embedding modeling achieves higher speech quality than that by the conventional method.

引用

页码：3373 / 3377

页数：5

共 8 条

[1] Acoustic Word Embeddings for End-to-End Speech Synthesis
Shen, Feiyu
Du, Chenpeng
Yu, Kai
APPLIED SCIENCES-BASEL, 2021, 11 (19):
[2] On the localness modeling for the self-attention based end-to-end speech synthesis
Yang, Shan
Lu, Heng
Kang, Shiyin
Xue, Liumeng
Xiao, Jinba
Su, Dan
Xie, Lei
Yu, Dong
NEURAL NETWORKS, 2020, 125 : 121 - 130
[3] Lhasa-Tibetan Speech Synthesis Using End-to-End Model
Zhao, Yue
Hu, Panhua
Xu, Xiaona
Wu, Licheng
Li, Xiali
IEEE ACCESS, 2019, 7 (140305-140311) : 140305 - 140311
[4] USING SPEECH SYNTHESIS TO TRAIN END-TO-END SPOKEN LANGUAGE UNDERSTANDING MODELS
Lugosch, Loren
Meyer, Brett H.
Nowrouzezahrai, Derek
Ravanelli, Mirco
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 8499 - 8503
[5] End-to-End Video-to-Speech Synthesis Using Generative Adversarial Networks
Mira, Rodrigo
Vougioukas, Konstantinos
Ma, Pingchuan
Petridis, Stavros
Schuller, Bjoern W.
Pantic, Maja
IEEE TRANSACTIONS ON CYBERNETICS, 2023, 53 (06) : 3454 - 3466
[6] Memory Attention: Robust Alignment Using Gating Mechanism for End-to-End Speech Synthesis
Lee, Joun Yeop
Cheon, Sung Jun
Choi, Byoung Jin
Kim, Nam Soo
IEEE SIGNAL PROCESSING LETTERS, 2020, 27 : 2004 - 2008
[7] MIST-Tacotron: End-to-End Emotional Speech Synthesis Using Mel-Spectrogram Image Style Transfer
Moon, Sungwoo
Kim, Sunghyun
Choi, Yong-Hoon
IEEE ACCESS, 2022, 10 : 25455 - 25463
[8] Speaker Adaptation Experiments with Limited Data for End-to-End Text-To-Speech Synthesis using Tacotron2
Mandeel, Ali Raheem
Al-Radhi, Mohammed Salah
Csapo, Tamas Gabor
INFOCOMMUNICATIONS JOURNAL, 2022, 14 (03): : 55 - 62

← 1 →