StyleFusion TTS: Multimodal Style-Control and Enhanced Feature Fusion for Zero-Shot Text-to-Speech Synthesis

被引:0
|
作者
Chene, Zhiyong [1 ]
Li, Xinnuo [1 ]
Ai, Zhiqi [1 ]
Xu, Shugong [1 ]
机构
[1] Shanghai Univ, Sch Commun & Informat Engn, Shanghai, Peoples R China
来源
PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2024, PT XI | 2025年 / 15041卷
基金
国家重点研发计划; 中国国家自然科学基金;
关键词
Text-to-speech synthesis; Voice cloning; Zero-shot learning; Multimodal learning;
D O I
10.1007/978-981-97-8795-1_18
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We introduce StyleFusion-TTS, a prompt and/or audio referenced, style- and speaker-controllable, zero-shot text-to-speech (TTS) synthesis system designed to enhance the editability and naturalness of current research literature. We propose a general front-end encoder as a compact and effective module to utilize multimodal inputs-including text prompts, audio references, and speaker timbre references-in a fully zero-shot manner and produce disentangled style and speaker control embeddings. Our novel approach also leverages a hierarchical conformer structure for the fusion of style and speaker control embeddings, aiming to achieve optimal feature fusion within the current advanced TTS architecture. StyleFusion-TTS is evaluated through multiple metrics, both subjectively and objectively. The system shows promising performance across our evaluations, suggesting its potential to contribute to the advancement of the field of zero-shot text-to-speech synthesis. A project website provides detailed information for demonstration and reproduction.
引用
收藏
页码:263 / 277
页数:15
相关论文
共 33 条
  • [1] Optimizing feature fusion for improved zero-shot adaptation in text-to-speech synthesis
    Chen, Zhiyong
    Ai, Zhiqi
    Ma, Youxuan
    Li, Xinnuo
    Xu, Shugong
    EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING, 2024, 2024 (01):
  • [2] VOICECRAFT: Zero-Shot Speech Editing and Text-to-Speech in theWild
    Peng, Puyuan
    Huang, Po-Yao
    Le, Shang-Wen
    Mohamed, Abdelrahman
    Harwath, David
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 12442 - 12462
  • [3] XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model
    Casanova, Edresson
    Davis, Kelly
    Goelge, Eren
    Goekncar, Gorkem
    Gulea, Iulian
    Hart, Logan
    Aljafari, Aya
    Meyer, Joshua
    Morais, Reuben
    Olayemi, Samuel
    Weber, Julian
    INTERSPEECH 2024, 2024, : 4978 - 4982
  • [4] EXACT PROSODY CLONING IN ZERO-SHOT MULTISPEAKER TEXT-TO-SPEECH
    Lux, Florian
    Koch, Julia
    Vu, Ngoc Thang
    2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 962 - 969
  • [5] Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration
    Tang, Chuanxin
    Luo, Chong
    Zhao, Zhiyuan
    Yin, Dacheng
    Zhao, Yucheng
    Zeng, Wenjun
    INTERSPEECH 2021, 2021, : 3600 - 3604
  • [6] ZET-Speech: Zero-shot adaptive Emotion-controllable Text-to-Speech Synthesis with Diffusion and Style-based Models
    Kang, Minki
    Han, Wooseok
    Hwang, Sung Ju
    Yang, Eunho
    INTERSPEECH 2023, 2023, : 4339 - 4343
  • [7] ZERO-SHOT TEXT-TO-SPEECH SYNTHESIS CONDITIONED USING SELF-SUPERVISED SPEECH REPRESENTATION MODEL
    Fujita, Kenichi
    Ashihara, Takanori
    Kanagawa, Hiroki
    Moriya, Takafumi
    Ijima, Yusuke
    2023 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW, 2023,
  • [8] Zero-Shot Voice Cloning Text-to-Speech for Dysphonia Disorder Speakers
    Azizah, Kurniawati
    IEEE ACCESS, 2024, 12 : 63528 - 63547
  • [9] Hola-TTS: A Cross-Lingual Zero-Shot Text-to-Speech System for Chinese, English, Japanese, and Korean
    Ding, Hongwu
    Zhou, Yiquan
    Wang, Wenyu
    Xu, JiaCheng
    Mei, Jiaqi
    2024 IEEE 14TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING, ISCSLP 2024, 2024, : 601 - 605
  • [10] Hola-TTS: A Cross-Lingual Zero-Shot Text-to-Speech System for Chinese, English, Japanese, and Korean
    Ding, Hongwu
    Zhou, Yiquan
    Wang, Wenyu
    Xu, JiaCheng
    Mei, Jiaqi
    2024 14th International Symposium on Chinese Spoken Language Processing, ISCSLP 2024, 2024, : 601 - 605