ChatGPT-EDSS: Empathetic Dialogue Speech Synthesis Trained from ChatGPT-derived ContextWord Embeddings

被引:3
作者
Saito, Yuki [1 ]
Takamichi, Shinnosuke [1 ]
Iimori, Eiji [1 ]
Tachibana, Kentaro [2 ]
Saruwatari, Hiroshi [1 ]
机构
[1] Univ Tokyo, Tokyo, Japan
[2] LINE Corp, Tokyo, Japan
来源
INTERSPEECH 2023 | 2023年
关键词
text-to-speech; empathetic dialogue speech synthesis; dialogue context; ChatGPT; prompt engineering;
D O I
10.21437/Interspeech.2023-1095
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
We propose ChatGPT-EDSS, an empathetic dialogue speech synthesis (EDSS) method using ChatGPT for extracting dialogue context. ChatGPT is a chatbot that can deeply understand the content and purpose of an input prompt and appropriately respond to the user's request. We focus on ChatGPT's reading comprehension and introduce it to EDSS, a task of synthesizing speech that can empathize with the interlocutor's emotion. Our method first gives chat history to ChatGPT and asks it to generate three words representing the intention, emotion, and speaking style for each line in the chat. Then, it trains an EDSS model using the embeddings of ChatGPT-derived context words as the conditioning features. The experimental results demonstrate that our method performs comparably to ones using emotion labels or neural network-derived context embeddings learned from chat histories. The collected ChatGPT-derived context information is available at our project page.
引用
收藏
页码:3048 / 3052
页数:5
相关论文
共 33 条
[1]  
Agostinelli Andrea, 2023, Musiclm: Generating music from text
[2]  
Brown T., 2020, P NEURIPS VANC CAN D
[3]  
Bruce G., 1995, P EUR, P1169
[4]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[5]   CONVERSATIONAL END-TO-END TTS FOR VOICE AGENTS [J].
Guo, Haohan ;
Zhang, Shaofei ;
Soong, Frank K. ;
He, Lei ;
Xie, Lei .
2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, :403-409
[6]  
Guo Z., 2022, Prompttts: Controllable text-to-speech with text descriptions
[7]   Learning cooperative persuasive dialogue policies using framing [J].
Hiraoka, Takuya ;
Neubig, Graham ;
Sakti, Sakriani ;
Toda, Tomoki ;
Nakamura, Satoshi .
SPEECH COMMUNICATION, 2016, 84 :83-96
[8]  
Ho J., 2020, P NEURIPS VANC CAN D
[9]   Evaluating Intention Communication by TTS using Explicit Definitions of Illocutionary Act Performance [J].
Hojo, Nobukatsu ;
Miyazaki, Noboru .
INTERSPEECH 2019, 2019, :1536-1540
[10]  
Kim S., 2014, P APSIPA ASC SIEM RE