VECL-TTS: Voice identity and Emotional style controllable Cross-Lingual Text-to-Speech

被引:0
|
作者
Gudmalwar, Ashishkumar [1 ]
Shah, Nirmesh [1 ]
Akarsh, Sai [1 ]
Wasnik, Pankaj [1 ]
Shah, Rajiv Ratn [2 ]
机构
[1] Sony Res India Pvt Ltd, Bangalore, Karnataka, India
[2] Indraprastha Inst Informat Technol IIIT, Delhi, India
来源
关键词
Cross-lingual TTS; emotion; voice cloning;
D O I
10.21437/Interspeech.2024-1672
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Despite the significant advancements in Text-to-Speech (TTS) systems, their full utilization in automatic dubbing remains limited. This task necessitates the extraction of voice identity and emotional style from a reference speech in a source language and subsequently transferring them to a target language using cross-lingual TTS techniques. While previous approaches have mainly concentrated on controlling voice identity within the cross-lingual TTS framework, there has been limited work on incorporating emotion and voice identity together. To this end, we introduce an end-to-end Voice Identity and Emotional Style Controllable Cross-Lingual (VECL) TTS system using multilingual speakers and an emotion embedding network. Moreover, we introduce content and style consistency losses to enhance the quality of synthesized speech further. The proposed system achieved an average relative improvement of 8.83% compared to the state-of-the-art (SOTA) methods on a database comprising English and three Indian languages (Hindi, Telugu, and Marathi).
引用
收藏
页码:3000 / 3004
页数:5
相关论文
共 50 条
  • [1] Description-based Controllable Text-to-Speech with Cross-Lingual Voice Control
    Yamamoto, Ryuichi
    Shirahata, Yuma
    Kawamura, Masaya
    Tachibana, Kentaro
    arXiv,
  • [2] CROSS-LINGUAL TEXT-TO-SPEECH VIA HIERARCHICAL STYLE TRANSFER
    Lee, Sang-Hoon
    Choi, Ha-Yeong
    Lee, Seong-Whan
    2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW 2024, 2024, : 25 - 26
  • [3] DSE-TTS: Dual Speaker Embedding for Cross-Lingual Text-to-Speech
    Liu, Sen
    Guo, Yiwei
    Du, Chenpeng
    Chen, Xie
    Yu, Kai
    INTERSPEECH 2023, 2023, : 616 - 620
  • [4] GenerTTS: Pronunciation Disentanglement for Timbre and Style Generalization in Cross-Lingual Text-to-Speech
    Cong, Yahuan
    Zhang, Haoyu
    Lin, Haopeng
    Liu, Shichao
    Wang, Chunfeng
    Ren, Yi
    Yin, Xiang
    Ma, Zejun
    INTERSPEECH 2023, 2023, : 5486 - 5490
  • [5] METTS: Multilingual Emotional Text-to-Speech by Cross-Speaker and Cross-Lingual Emotion Transfer
    Zhu, Xinfa
    Lei, Yi
    Li, Tao
    Zhang, Yongmao
    Zhou, Hongbin
    Lu, Heng
    Xie, Lei
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 1506 - 1518
  • [6] X-E-Speech: Joint Training Framework of Non-Autoregressive Cross-lingual Emotional Text-to-Speech and Voice Conversion
    Guo, Houjian
    Liu, Chaoran
    Ishi, Carlos Toshinori
    Ishiguro, Hiroshi
    INTERSPEECH 2024, 2024, : 4983 - 4987
  • [7] Hola-TTS: A Cross-Lingual Zero-Shot Text-to-Speech System for Chinese, English, Japanese, and Korean
    Ding, Hongwu
    Zhou, Yiquan
    Wang, Wenyu
    Xu, JiaCheng
    Mei, Jiaqi
    2024 IEEE 14TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING, ISCSLP 2024, 2024, : 601 - 605
  • [8] Hola-TTS: A Cross-Lingual Zero-Shot Text-to-Speech System for Chinese, English, Japanese, and Korean
    Ding, Hongwu
    Zhou, Yiquan
    Wang, Wenyu
    Xu, JiaCheng
    Mei, Jiaqi
    2024 14th International Symposium on Chinese Spoken Language Processing, ISCSLP 2024, 2024, : 601 - 605
  • [9] DiCLET-TTS: Diffusion Model Based Cross-Lingual Emotion Transfer for Text-to-Speech - A Study Between English and Mandarin
    Li T.
    Hu C.
    Cong J.
    Zhu X.
    Li J.
    Tian Q.
    Wang Y.
    Xie L.
    IEEE/ACM Transactions on Audio Speech and Language Processing, 2023, 31 : 3418 - 3430
  • [10] Exploring Timbre Disentanglement in Non-Autoregressive Cross-Lingual Text-to-Speech
    Zhan, Haoyue
    Yu, Xinyuan
    Zhang, Haitong
    Zhang, Yang
    Lin, Yue
    INTERSPEECH 2022, 2022, : 4247 - 4251