CONVERSATIONAL END-TO-END TTS FOR VOICE AGENTS

被引：31

作者：

Guo, Haohan ^{[1
,3
]}

Zhang, Shaofei ^{[2
]}

Soong, Frank K. ^{[2
]}

He, Lei ^{[2
]}

Xie, Lei ^{[1
]}

机构：

[1] Northwestern Polytech Univ, Sch Comp Sci, Audio Speech & Language Proc Grp ASLP NPU, Xian, Peoples R China

[2] Microsoft China, Beijing, Peoples R China

[3] Microsoft, Redmond, WA USA

来源：

2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT) | 2021年

关键词：

Text-to-Speech; End-to-End; Conversational TTS; Speech Corpus; Voice Agent;

D O I：

10.1109/SLT48900.2021.9383460

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

End-to-end neural TTS has achieved excellent performance on reading style speech synthesis. However, it is still a challenge to build a high-quality conversational TTS due to the limitations of corpus and modeling capability. This study aims at building a conversational TTS for a voice agent under sequence to sequence modeling framework. We firstly construct a spontaneous conversational speech corpus well designed for the voice agent with a new recording scheme ensuring both recording quality and conversational speaking style. Secondly, we propose a conversation context-aware end-toend TTS approach that employs an auxiliary encoder and a conversational context encoder to specifically reinforce the information about the current utterance and its context in a conversation as well. Experimental results show that the proposed approach produces more natural prosody in accordance with the conversational context, with significant preference gains at both utterance-level and conversation-level. Moreover, we find that the model has the ability to express some spontaneous behaviors like fillers and repeated words, which makes the conversational speaking style more realistic.

引用

页码：403 / 409

页数：7

共 26 条

[1] [Anonymous], 2016, PROC 9 ISCA SPEEC
[2] [Anonymous], 2018, ICML
[3] [Anonymous], 2003, ISCA IEEE WORKSH SPO
[4] Cong J., 2020, P INTERSPEECH, P811
[5] Devlin J., 2019, P 2019 C N AM CHAPT, P4171, DOI [DOI 10.18653/V1/N19-1423, 10.18653/v1/N19-1423]
[6] Guo H., 2019, INTERSPEECH
[7] He M., 2019, INTERSPEECH
[8] Husin M., 2011, DATA SCI J
[9] Koriyama T., 2011, INTERSPEECH
[10] Koriyama T, 2010, 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, P853

← 1 2 3 →