Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

被引:6
作者
Chen, Sanyuan [1 ]
Wang, Chengyi [2 ]
Wu, Yu [4 ]
Zhang, Ziqiang [3 ]
Zhou, Long [4 ]
Liu, Shujie [4 ]
Chen, Zhuo [4 ]
Liu, Yanqing [4 ]
Wang, Huaming [4 ]
Li, Jinyu [4 ]
He, Lei [4 ]
Zhao, Sheng [4 ]
Wei, Furu [4 ]
机构
[1] Harbin Inst Technol, Comp Sci & Technol, Harbin 150001, Peoples R China
[2] Nankai Univ, Tianjin 300071, Peoples R China
[3] Univ Sci & Technol China, Hefei 230026, Peoples R China
[4] Microsoft Corp, Redmond, WA 98052 USA
来源
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2025年 / 33卷
关键词
Codes; Codecs; Data models; Acoustics; Speech coding; Training data; Recording; Speech recognition; Decoding; Vocoders; Zero-shot text to speech synthesis; speech generation; voice cloning; language modeling; pre-training; in-Context learning;
D O I
10.1109/TASLPRO.2025.3530270
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
We introduce a language modeling approach for text to speech synthesis (TTS). Specifically, we train a neural codec language model (called VALL-E) using discrete codes derived from an off-the-shelf neural audio codec model, and regard TTS as a conditional language modeling task rather than continuous signal regression as in previous work. During the pre-training stage, we scale up the TTS training data to 50 k hours of English speech which is hundreds of times larger than existing systems. VALL-E emerges in-context learning capability and can be used to synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as a prompt. Experiment results show that VALL-E significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity. In addition, we find VALL-E could preserve the speaker's emotion and acoustic environment from the prompt in synthesis.
引用
收藏
页码:705 / 718
页数:14
相关论文
共 83 条
[1]  
Adigwe A, 2018, Arxiv, DOI arXiv:1806.09514
[2]  
Anil R, 2023, Arxiv, DOI [arXiv:2305.10403, 10.48550/arXiv.2305.10403]
[3]  
Ao JY, 2022, PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), P5723
[4]  
Arik SÖ, 2018, ADV NEUR IN, V31
[5]  
Baevski A., 2020, P INT C MACH LEARN
[6]  
Baevski A, 2020, ADV NEUR IN, V33
[7]   AudioLM: A Language Modeling Approach to Audio Generation [J].
Borsos, Zalan ;
Marinier, Raphael ;
Vincent, Damien ;
Kharitonov, Eugene ;
Pietquin, Olivier ;
Sharifi, Matt ;
Roblek, Dominik ;
Teboul, Olivier ;
Grangier, David ;
Tagliasacchi, Marco ;
Zeghidour, Neil .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2023, 31 :2523-2533
[8]  
Borsos Z, 2023, Arxiv, DOI arXiv:2305.09636
[9]  
Brown TomB., Language models are few-shot learners
[10]  
Cai W., 2018, P SPEAK LANG REC WOR, P74, DOI DOI 10.21437/ODYSSEY.2018-11