Controllable Context-aware Conversational Speech Synthesis

被引：10

作者：

Cong, Jian ^{[1
,2
]}

Yang, Shan ^{[2
]}

Hu, Na ^{[2
]}

Li, Guangzhi ^{[2
]}

Xie, Lei ^{[1
]}

Su, Dan ^{[2
]}

机构：

[1] Northwestern Polytech Univ, Sch Comp Sci, Audio Speech & Language Proc Grp ASLP NPU, Xian, Peoples R China

[2] Tencent AI Lab, Shenzhen, Peoples R China

来源：

INTERSPEECH 2021 | 2021年

关键词：

Speech synthesis; Spontaneous speech; Conversational speech;

D O I：

10.21437/Interspeech.2021-412

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

In spoken conversations, spontaneous behaviors like filled pause and prolongations always happen. Conversational partner tends to align features of their speech with their interlocutor which is known as entrainment. To produce human-like conversations, we propose a unified controllable spontaneous conversational speech synthesis framework to model the above two phenomena. Specifically, we use explicit labels to represent two typical spontaneous behaviors filled-pause and prolongation in the acoustic model and develop a neural network based predictor to predict the occurrences of the two behaviors from text. We subsequently develop an algorithm based on the predictor to control the occurrence frequency of the behaviors, making the synthesized speech vary from less disfluent to more disfluent. To model the speech entrainment at acoustic level, we utilize a context acoustic encoder to extract a global style embedding from the previous speech conditioning on the synthesizing of current speech. Furthermore, since the current and previous utterances belong to the different speakers in a conversation, we add a domain adversarial training module to eliminate the speaker-related information in the acoustic encoder while maintaining the style-related information. Experiments show that our proposed approach can synthesize realistic conversations and control the occurrences of the spontaneous behaviors naturally.

引用

页码：4658 / 4662

页数：5

共 27 条

[1]

[Anonymous], 2019, ARXIV190402373

[2]

[Anonymous], 2017, P ICML

[3]

[Anonymous], 2019, P AAAI

[4]

[Anonymous], 2019, P ICASSP

[5]

Battenberg E, 2020, INT CONF ACOUST SPEE, P6194, DOI [10.1109/ICASSP40776.2020.9054106, 10.1109/icassp40776.2020.9054106]

[6]

Black AW, 2007, INT CONF ACOUST SPEE, P1229

[7] Variable-Fidelity Surrogate Model-Based Machine Learning-Assisted Optimization and Its Application to Worst-Case Performance Searching of Antennas [J].

Chen, Weiqi ;

Wu, Qi ;

Yu, Chen ;

Wang, Haiming ;

Hong, Wei .

2020 IEEE INTERNATIONAL SYMPOSIUM ON ANTENNAS AND PROPAGATION AND NORTH AMERICAN RADIO SCIENCE MEETING, 2020, :1027-1028

[8]

Dall R, 2014, INTERSPEECH, P51

[9]

Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171

[10]

Fang W., 2019, ARXIV190607307

← 1 2 3 →