VECL-TTS: Voice identity and Emotional style controllable Cross-Lingual Text-to-Speech

被引:0
|
作者
Gudmalwar, Ashishkumar [1 ]
Shah, Nirmesh [1 ]
Akarsh, Sai [1 ]
Wasnik, Pankaj [1 ]
Shah, Rajiv Ratn [2 ]
机构
[1] Sony Res India Pvt Ltd, Bangalore, Karnataka, India
[2] Indraprastha Inst Informat Technol IIIT, Delhi, India
来源
关键词
Cross-lingual TTS; emotion; voice cloning;
D O I
10.21437/Interspeech.2024-1672
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Despite the significant advancements in Text-to-Speech (TTS) systems, their full utilization in automatic dubbing remains limited. This task necessitates the extraction of voice identity and emotional style from a reference speech in a source language and subsequently transferring them to a target language using cross-lingual TTS techniques. While previous approaches have mainly concentrated on controlling voice identity within the cross-lingual TTS framework, there has been limited work on incorporating emotion and voice identity together. To this end, we introduce an end-to-end Voice Identity and Emotional Style Controllable Cross-Lingual (VECL) TTS system using multilingual speakers and an emotion embedding network. Moreover, we introduce content and style consistency losses to enhance the quality of synthesized speech further. The proposed system achieved an average relative improvement of 8.83% compared to the state-of-the-art (SOTA) methods on a database comprising English and three Indian languages (Hindi, Telugu, and Marathi).
引用
收藏
页码:3000 / 3004
页数:5
相关论文
共 50 条
  • [21] End-to-end Text-to-speech for Low-resource Languages by Cross-Lingual Transfer Learning
    Chen, Yuan-Jui
    Tu, Tao
    Yeh, Cheng-chieh
    Lee, Hung-yi
    INTERSPEECH 2019, 2019, : 2075 - 2079
  • [22] The paradigm for creating multi-lingual text-to-speech voice databases
    Chu, Min
    Zhao, Yong
    Chen, Yining
    Wang, Lijuan
    Soong, Frank
    CHINESE SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, 2006, 4274 : 736 - +
  • [23] PVAE-TTS: ADAPTIVE TEXT-TO-SPEECH VIA PROGRESSIVE STYLE ADAPTATION
    Lee, Ji-Hyun
    Lee, Sang-Hoon
    Kim, Ji-Hoon
    Lee, Seong-Whan
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6312 - 6316
  • [24] EMOTIONAL VOICE CONVERSION USING MULTITASK LEARNING WITH TEXT-TO-SPEECH
    Kim, Tae-Ho
    Cho, Sungjae
    Choi, Shinkook
    Park, Sejik
    Lee, Soo-Young
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7774 - 7778
  • [25] Text-to-speech system for low-resource language using cross-lingual transfer learning and data augmentation
    Byambadorj, Zolzaya
    Nishimura, Ryota
    Ayush, Altangerel
    Ohta, Kengo
    Kitaoka, Norihide
    EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING, 2021, 2021 (01)
  • [26] Text-to-speech system for low-resource language using cross-lingual transfer learning and data augmentation
    Zolzaya Byambadorj
    Ryota Nishimura
    Altangerel Ayush
    Kengo Ohta
    Norihide Kitaoka
    EURASIP Journal on Audio, Speech, and Music Processing, 2021
  • [27] A UNIFIED TRAJECTORY TILING APPROACH TO HIGH QUALITY TTS AND CROSS-LINGUAL VOICE TRANSFORMATION
    Qian, Yao
    Soong, Frank K.
    2012 8TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING, 2012, : 165 - 169
  • [28] ASR data augmentation in low-resource settings using cross-lingual multi-speaker TTS and cross-lingual voice conversion
    Casanova, Edresson
    Shulby, Christopher
    Korolev, Alexander
    Candido Junior, Arnaldo
    Soares, Anderson da Silva
    Aluisio, Sandra
    Ponti, Moacir Antonelli
    INTERSPEECH 2023, 2023, : 1244 - 1248
  • [29] PromptStyle: Controllable Style Transfer for Text-to-Speech with Natural Language Descriptions
    Liu, Guanghou
    Zhang, Yongmao
    Lei, Yi
    Chen, Yunlin
    Wang, Rui
    Li, Zhifei
    Xie, Lei
    INTERSPEECH 2023, 2023, : 4888 - 4892
  • [30] LIGHT-TTS: LIGHTWEIGHT MULTI-SPEAKER MULTI-LINGUAL TEXT-TO-SPEECH
    Li, Song
    Ouyang, Beibei
    Li, Lin
    Hong, Qingyang
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 8383 - 8387