Expressive Text-to-Speech with Contextual Background for ICAGC 2024

被引:0
作者
Jiang, Yu [1 ]
Wang, Tianrui [1 ]
Wang, Haoyu [1 ]
Gong, Cheng [1 ]
Liu, Qiuyu [1 ]
Huang, Zikang [1 ]
Wang, Longbiao [1 ]
Dang, Jianwu [1 ]
机构
[1] Tianjin Univ, Coll Intelligence & Comp, Tianjin Key Lab Cognit Comp & Applicat, Tianjin, Peoples R China
来源
2024 IEEE 14TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING, ISCSLP 2024 | 2024年
关键词
TTS; TTA; Large Language Model; Speech Fusion Generation; ICAGC; 2024; MODEL;
D O I
10.1109/ISCSLP63861.2024.10800495
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we describe a speech synthesis system that could generate speech with specified emotions and background sounds, and implement it in Track 2 of the ICAGC 2024 of ISC-SLP. This project innovatively combines two advanced audio generation models, GPT-SoVITS and AudioLDM 2, to generate emotional speech with a specific background. GPT-SoVITS is used to clone the timber and emotion from the target speaker's voice. AudioLDM 2 is employed to generate background audio according to the textual content. Ultimately, emotional speech with specific background audio can be generated by combining the outputs of the two models. Official evaluations of the generated results focused on 3 aspects: speaker similarity, audio and speech convincing matching degree, and emotional inspiration degree. Our method achieves a speaker similarity score of 3.33 and an emotional inspiration score of 3.33. In terms of convincing matching, our score is 2.66. In subjective MOS listening tests, the overall score averages 3.06 with a standard deviation of 0.41. Overall, our system secures the 2nd place.
引用
收藏
页码:611 / 615
页数:5
相关论文
共 29 条
[1]  
Brown TB, 2020, ADV NEUR IN, V33
[2]   3-D Convolutional Recurrent Neural Networks With Attention Model for Speech Emotion Recognition [J].
Chen, Mingyi ;
He, Xuanji ;
Yang, Jing ;
Zhang, Han .
IEEE SIGNAL PROCESSING LETTERS, 2018, 25 (10) :1440-1444
[3]   WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing [J].
Chen, Sanyuan ;
Wang, Chengyi ;
Chen, Zhengyang ;
Wu, Yu ;
Liu, Shujie ;
Chen, Zhuo ;
Li, Jinyu ;
Kanda, Naoyuki ;
Yoshioka, Takuya ;
Xiao, Xiong ;
Wu, Jian ;
Zhou, Long ;
Ren, Shuo ;
Qian, Yanmin ;
Qian, Yao ;
Zeng, Michael ;
Yu, Xiangzhan ;
Wei, Furu .
IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2022, 16 (06) :1505-1518
[4]   W2V-BERT: COMBINING CONTRASTIVE LEARNING AND MASKED LANGUAGE MODELING FOR SELF-SUPERVISED SPEECH PRE-TRAINING [J].
Chung, Yu-An ;
Zhang, Yu ;
Han, Wei ;
Chiu, Chung-Cheng ;
Qin, James ;
Pang, Ruoming ;
Wu, Yonghui .
2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, :244-250
[5]  
Devlin J, 2019, Arxiv, DOI arXiv:1810.04805
[6]  
Gong C, 2024, Arxiv, DOI arXiv:2312.14398
[7]  
Ho J, 2020, P 34 INT C NEUR INF, P6840
[8]   HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units [J].
Hsu, Wei-Ning ;
Bolte, Benjamin ;
Tsai, Yao-Hung Hubert ;
Lakhotia, Kushal ;
Salakhutdinov, Ruslan ;
Mohamed, Abdelrahman .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 :3451-3460
[9]  
Huang Rongjie, P MACHINE LEARNING R
[10]  
2022, Arxiv, DOI arXiv:2209.15352