Expressive Text-to-Speech with Contextual Background for ICAGC 2024

被引:0
作者
Jiang, Yu [1 ]
Wang, Tianrui [1 ]
Wang, Haoyu [1 ]
Gong, Cheng [1 ]
Liu, Qiuyu [1 ]
Huang, Zikang [1 ]
Wang, Longbiao [1 ]
Dang, Jianwu [1 ]
机构
[1] Tianjin Univ, Coll Intelligence & Comp, Tianjin Key Lab Cognit Comp & Applicat, Tianjin, Peoples R China
来源
2024 IEEE 14TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING, ISCSLP 2024 | 2024年
关键词
TTS; TTA; Large Language Model; Speech Fusion Generation; ICAGC; 2024; MODEL;
D O I
10.1109/ISCSLP63861.2024.10800495
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we describe a speech synthesis system that could generate speech with specified emotions and background sounds, and implement it in Track 2 of the ICAGC 2024 of ISC-SLP. This project innovatively combines two advanced audio generation models, GPT-SoVITS and AudioLDM 2, to generate emotional speech with a specific background. GPT-SoVITS is used to clone the timber and emotion from the target speaker's voice. AudioLDM 2 is employed to generate background audio according to the textual content. Ultimately, emotional speech with specific background audio can be generated by combining the outputs of the two models. Official evaluations of the generated results focused on 3 aspects: speaker similarity, audio and speech convincing matching degree, and emotional inspiration degree. Our method achieves a speaker similarity score of 3.33 and an emotional inspiration score of 3.33. In terms of convincing matching, our score is 2.66. In subjective MOS listening tests, the overall score averages 3.06 with a standard deviation of 0.41. Overall, our system secures the 2nd place.
引用
收藏
页码:611 / 615
页数:5
相关论文
共 29 条
[11]  
Li Y., 2024, 14 INT S CHIN SPOK L
[12]   AudioLDM 2: Learning Holistic Audio Generation With Self-Supervised Pretraining [J].
Liu, Haohe ;
Yuan, Yi ;
Liu, Xubo ;
Mei, Xinhao ;
Kong, Qiuqiang ;
Tian, Qiao ;
Wang, Yuping ;
Wang, Wenwu ;
Wang, Yuxuan ;
Plumbley, Mark D. .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 :2871-2883
[13]  
Liu HH, 2023, Arxiv, DOI [arXiv:2301.12503, 10.48550/ARXIV.2301.12503]
[14]  
Majumder N, 2024, Arxiv, DOI arXiv:2404.09956
[15]  
Oh Y, 2022, Arxiv, DOI arXiv:2211.06160
[16]  
Kingma DP, 2014, Arxiv, DOI [arXiv:1312.6114, DOI 10.48550/ARXIV.1312.6114]
[17]  
Qiang C., 2024, ICASSP 2024 2024 IEE, p10 196
[18]  
Radford A, 2019, OPENAI BLOG
[19]   High-Resolution Image Synthesis with Latent Diffusion Models [J].
Rombach, Robin ;
Blattmann, Andreas ;
Lorenz, Dominik ;
Esser, Patrick ;
Ommer, Bjoern .
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, :10674-10685
[20]   Mean opinion score (MOS) revisited: methods and applications, limitations and alternatives [J].
Streijl, Robert C. ;
Winkler, Stefan ;
Hands, David S. .
MULTIMEDIA SYSTEMS, 2016, 22 (02) :213-227