StyleFusion TTS: Multimodal Style-Control and Enhanced Feature Fusion for Zero-Shot Text-to-Speech Synthesis

被引：0

作者：

Chene, Zhiyong ^{[1
]}

Li, Xinnuo ^{[1
]}

Ai, Zhiqi ^{[1
]}

Xu, Shugong ^{[1
]}

机构：

[1] Shanghai Univ, Sch Commun & Informat Engn, Shanghai, Peoples R China

来源：

PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2024, PT XI | 2025年 / 15041卷

基金：

国家重点研发计划; 中国国家自然科学基金;

关键词：

Text-to-speech synthesis; Voice cloning; Zero-shot learning; Multimodal learning;

D O I：

10.1007/978-981-97-8795-1_18

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We introduce StyleFusion-TTS, a prompt and/or audio referenced, style- and speaker-controllable, zero-shot text-to-speech (TTS) synthesis system designed to enhance the editability and naturalness of current research literature. We propose a general front-end encoder as a compact and effective module to utilize multimodal inputs-including text prompts, audio references, and speaker timbre references-in a fully zero-shot manner and produce disentangled style and speaker control embeddings. Our novel approach also leverages a hierarchical conformer structure for the fusion of style and speaker control embeddings, aiming to achieve optimal feature fusion within the current advanced TTS architecture. StyleFusion-TTS is evaluated through multiple metrics, both subjectively and objectively. The system shows promising performance across our evaluations, suggesting its potential to contribute to the advancement of the field of zero-shot text-to-speech synthesis. A project website provides detailed information for demonstration and reproduction.

引用

页码：263 / 277

页数：15

共 33 条

[1] Optimizing feature fusion for improved zero-shot adaptation in text-to-speech synthesis
Chen, Zhiyong
Ai, Zhiqi
Ma, Youxuan
Li, Xinnuo
Xu, Shugong
EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING, 2024, 2024 (01):
[2] VOICECRAFT: Zero-Shot Speech Editing and Text-to-Speech in theWild
Peng, Puyuan
Huang, Po-Yao
Le, Shang-Wen
Mohamed, Abdelrahman
Harwath, David
PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 12442 - 12462
[3] XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model
Casanova, Edresson
Davis, Kelly
Goelge, Eren
Goekncar, Gorkem
Gulea, Iulian
Hart, Logan
Aljafari, Aya
Meyer, Joshua
Morais, Reuben
Olayemi, Samuel
Weber, Julian
INTERSPEECH 2024, 2024, : 4978 - 4982
[4] EXACT PROSODY CLONING IN ZERO-SHOT MULTISPEAKER TEXT-TO-SPEECH
Lux, Florian
Koch, Julia
Vu, Ngoc Thang
2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 962 - 969
[5] Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration
Tang, Chuanxin
Luo, Chong
Zhao, Zhiyuan
Yin, Dacheng
Zhao, Yucheng
Zeng, Wenjun
INTERSPEECH 2021, 2021, : 3600 - 3604
[6] ZET-Speech: Zero-shot adaptive Emotion-controllable Text-to-Speech Synthesis with Diffusion and Style-based Models
Kang, Minki
Han, Wooseok
Hwang, Sung Ju
Yang, Eunho
INTERSPEECH 2023, 2023, : 4339 - 4343
[7] ZERO-SHOT TEXT-TO-SPEECH SYNTHESIS CONDITIONED USING SELF-SUPERVISED SPEECH REPRESENTATION MODEL
Fujita, Kenichi
Ashihara, Takanori
Kanagawa, Hiroki
Moriya, Takafumi
Ijima, Yusuke
2023 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW, 2023,
[8] Zero-Shot Voice Cloning Text-to-Speech for Dysphonia Disorder Speakers
Azizah, Kurniawati
IEEE ACCESS, 2024, 12 : 63528 - 63547
[9] Hola-TTS: A Cross-Lingual Zero-Shot Text-to-Speech System for Chinese, English, Japanese, and Korean
Ding, Hongwu
Zhou, Yiquan
Wang, Wenyu
Xu, JiaCheng
Mei, Jiaqi
2024 IEEE 14TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING, ISCSLP 2024, 2024, : 601 - 605
[10] Hola-TTS: A Cross-Lingual Zero-Shot Text-to-Speech System for Chinese, English, Japanese, and Korean
Ding, Hongwu
Zhou, Yiquan
Wang, Wenyu
Xu, JiaCheng
Mei, Jiaqi
2024 14th International Symposium on Chinese Spoken Language Processing, ISCSLP 2024, 2024, : 601 - 605

← 1 2 3 4 →