Interactive Text-to-Speech System via Joint Style Analysis

被引:4
|
作者
Gao, Yang [1 ,2 ]
Zheng, Weiyi [2 ]
Yang, Zhaojun [2 ]
Koehler, Thilo [2 ]
Fuegen, Christian [2 ]
He, Qing [2 ]
机构
[1] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA
[2] Facebook AI, Menlo Pk, CA 94025 USA
来源
关键词
Text-to-speech synthesis; emotion; style; semisupervised;
D O I
10.21437/Interspeech.2020-3069
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
While modern TTS technologies have made significant advancements in audio quality, there is still a lack of behavior naturalness compared to conversing with people. We propose a style-embedded TTS system that generates styled responses based on the speech query style. To achieve this, the system includes a style extraction model that extracts a style embedding from the speech query, which is then used by the TTS to produce a matching response. We faced two main challenges: 1) only a small portion of the TTS training dataset has style labels, which is needed to train a multi-style TTS that respects different style embeddings during inference. 2) The TTS system and the style extraction model have disjoint training datasets. We need consistent style labels across these two datasets so that the TTS can learn to respect the labels produced by the style extraction model during inference. To solve these, we adopted a semi-supervised approach that uses the style extraction model to create style labels for the TTS dataset and applied transfer learning to learn the style embedding jointly. Our experiment results show user preference for the styled TTS responses and demonstrate the style-embedded TTS system's capability of mimicking the speech query style.
引用
收藏
页码:4447 / 4451
页数:5
相关论文
共 50 条
  • [41] Pause of empty words in text-to-speech system
    Pan, Wei-Qiang
    He, Qian-Hua
    Wei, Gang
    Huanan Ligong Daxue Xuebao/Journal of South China University of Technology (Natural Science), 2002, 30 (06):
  • [42] PROGRAM LIBRARY FOR DECTALK TEXT-TO-SPEECH SYSTEM
    LOCK, S
    LEONG, CK
    BEHAVIOR RESEARCH METHODS INSTRUMENTS & COMPUTERS, 1989, 21 (03): : 394 - 400
  • [43] The Laureate text-to-speech system - Architecture and applications
    Page, JH
    Breen, AP
    BT TECHNOLOGY JOURNAL, 1996, 14 (01): : 57 - 67
  • [44] An Automatic Soundtracking System for Text-to-Speech Audiobooks
    Chen, Zikai
    Wu, Lin
    Pan, Junjie
    Yin, Xiang
    INTERSPEECH 2022, 2022, : 476 - 480
  • [45] A Complete Croatian Language Text-to-Speech System
    Krekovic, Gordan
    Prenner, Vladimir
    PROCEEDINGS ELMAR-2010, 2010, : 351 - 354
  • [46] A Prosodic Text-to-Speech System for Yoruba Language
    Akinwonmi, Akintoba Emmanuel
    Alese, Boniface Kayode
    2013 8TH INTERNATIONAL CONFERENCE FOR INTERNET TECHNOLOGY AND SECURED TRANSACTIONS (ICITST), 2013, : 630 - 635
  • [47] The Bell Labs German text-to-speech system
    Mobius, B
    COMPUTER SPEECH AND LANGUAGE, 1999, 13 (04): : 319 - 357
  • [48] THE SYNTHESIS RULES IN A CHINESE TEXT-TO-SPEECH SYSTEM
    LEE, LS
    TSENG, CY
    MING, OY
    IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1989, 37 (09): : 1309 - 1320
  • [49] Modeling arabic prosody for a text-to-speech system
    Boukadida, F.
    Ellouze, N.
    International Review on Computers and Software, 2009, 4 (03) : 337 - 343
  • [50] CATOTRON - A Neural Text-to-Speech System in Catalan
    Kulebi, Baybars
    Oktem, Alp
    Peiro-Lilja, Alex
    Pascual, Santiago
    Farrus, Mireia
    INTERSPEECH 2020, 2020, : 490 - 491