Interactive Text-to-Speech System via Joint Style Analysis

被引：4

作者：

Gao, Yang ^{[1
,2
]}

Zheng, Weiyi ^{[2
]}

Yang, Zhaojun ^{[2
]}

Koehler, Thilo ^{[2
]}

Fuegen, Christian ^{[2
]}

He, Qing ^{[2
]}

机构：

[1] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA

[2] Facebook AI, Menlo Pk, CA 94025 USA

来源：

INTERSPEECH 2020 | 2020年

关键词：

Text-to-speech synthesis; emotion; style; semisupervised;

D O I：

10.21437/Interspeech.2020-3069

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

While modern TTS technologies have made significant advancements in audio quality, there is still a lack of behavior naturalness compared to conversing with people. We propose a style-embedded TTS system that generates styled responses based on the speech query style. To achieve this, the system includes a style extraction model that extracts a style embedding from the speech query, which is then used by the TTS to produce a matching response. We faced two main challenges: 1) only a small portion of the TTS training dataset has style labels, which is needed to train a multi-style TTS that respects different style embeddings during inference. 2) The TTS system and the style extraction model have disjoint training datasets. We need consistent style labels across these two datasets so that the TTS can learn to respect the labels produced by the style extraction model during inference. To solve these, we adopted a semi-supervised approach that uses the style extraction model to create style labels for the TTS dataset and applied transfer learning to learn the style embedding jointly. Our experiment results show user preference for the styled TTS responses and demonstrate the style-embedded TTS system's capability of mimicking the speech query style.

引用

页码：4447 / 4451

页数：5

共 50 条

[41] Pause of empty words in text-to-speech system
Pan, Wei-Qiang
He, Qian-Hua
Wei, Gang
Huanan Ligong Daxue Xuebao/Journal of South China University of Technology (Natural Science), 2002, 30 (06):
[42] PROGRAM LIBRARY FOR DECTALK TEXT-TO-SPEECH SYSTEM
LOCK, S
LEONG, CK
BEHAVIOR RESEARCH METHODS INSTRUMENTS & COMPUTERS, 1989, 21 (03): : 394 - 400
[43] The Laureate text-to-speech system - Architecture and applications
Page, JH
Breen, AP
BT TECHNOLOGY JOURNAL, 1996, 14 (01): : 57 - 67
[44] An Automatic Soundtracking System for Text-to-Speech Audiobooks
Chen, Zikai
Wu, Lin
Pan, Junjie
Yin, Xiang
INTERSPEECH 2022, 2022, : 476 - 480
[45] A Complete Croatian Language Text-to-Speech System
Krekovic, Gordan
Prenner, Vladimir
PROCEEDINGS ELMAR-2010, 2010, : 351 - 354
[46] A Prosodic Text-to-Speech System for Yoruba Language
Akinwonmi, Akintoba Emmanuel
Alese, Boniface Kayode
2013 8TH INTERNATIONAL CONFERENCE FOR INTERNET TECHNOLOGY AND SECURED TRANSACTIONS (ICITST), 2013, : 630 - 635
[47] The Bell Labs German text-to-speech system
Mobius, B
COMPUTER SPEECH AND LANGUAGE, 1999, 13 (04): : 319 - 357
[48] THE SYNTHESIS RULES IN A CHINESE TEXT-TO-SPEECH SYSTEM
LEE, LS
TSENG, CY
MING, OY
IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1989, 37 (09): : 1309 - 1320
[49] Modeling arabic prosody for a text-to-speech system
Boukadida, F.
Ellouze, N.
International Review on Computers and Software, 2009, 4 (03) : 337 - 343
[50] CATOTRON - A Neural Text-to-Speech System in Catalan
Kulebi, Baybars
Oktem, Alp
Peiro-Lilja, Alex
Pascual, Santiago
Farrus, Mireia
INTERSPEECH 2020, 2020, : 490 - 491

← 1 2 3 4 5 →