Leveraging Low-Rank Adaptation for Parameter-Efficient Fine-Tuning in Multi-Speaker Adaptive Text-to-Speech Synthesis

被引:0
作者
Hong, Changi [1 ]
Lee, Jung Hyuk [2 ]
Kim, Hong Kook [1 ,2 ,3 ]
机构
[1] Gwangju Inst Sci & Technol, AI Grad Sch, Gwangju 61005, South Korea
[2] Gwangju Inst Sci & Technol, Sch Elect Engn & Comp Sci, Gwangju 61005, South Korea
[3] AunionAI Co Ltd, Gwangju 61005, South Korea
来源
IEEE ACCESS | 2024年 / 12卷
关键词
Adaptation models; Predictive models; Computational modeling; Acoustics; Training; Data models; Tuning; Text to speech; Load modeling; Zero shot learning; Text-to-speech synthesis; low-rank adaptation; multi-speaker adaptation; parameter-efficient fine-tuning; residual adapter; conditional layer normalization; variational inference with adversarial learning;
D O I
10.1109/ACCESS.2024.3515206
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Text-to-speech (TTS) technology is commonly used to generate personalized voices for new speakers. Despite considerable progress in TTS technology, personal voice synthesis remains problematic in achieving high-quality custom voices. In addressing this issue, fine-tuning a TTS model is a popular approach. However, it must be applied once for every new speaker, which results in both time-consuming model training and excessive storage of the TTS model parameters. Therefore, to support a large number of new speakers, a parameter-efficient fine-tuning (PEFT) approach must be used instead of full fine-tuning, as well as an approach to accommodate multiple speakers with a small number of parameters. To this end, this work first incorporates a low-rank adaptation-based fine-tuning method for variational inference with adversarial learning for end-to-end TTS (VITS) model. Next, the approach is extended with conditional layer normalization for multi-speaker fine-tuning, and the residual adapter is further applied to the text encoder outputs of the VITS model to improve the intelligibility and naturalness of the speech quality of personalized speech. The performance of the fine-tuned TTS models with different combinations of fine-tuning modules is evaluated using the Libri-TTS-100, VCTK, and Common Voice datasets, as well as a Korean multi-speaker dataset. Objective and subjective quality comparisons reveal that the proposed approach achieves speech quality comparable to that of a fully fine-tuned model, with around a 90% reduction in the number of model parameters.
引用
收藏
页码:190711 / 190727
页数:17
相关论文
共 26 条
[21]   A Controllable Multi-Lingual Multi-Speaker Multi-Style Text-to-Speech Synthesis With Multivariate Information Minimization [J].
Cheon, Sung Jun ;
Choi, Byoung Jin ;
Kim, Minchan ;
Lee, Hyeonseung ;
Kim, Nam Soo .
IEEE SIGNAL PROCESSING LETTERS, 2022, 29 :55-59
[22]   Parameter-Efficient Fine-Tuning of Pre-trained Large Language Models for Financial Text Analysis [J].
Langa, Kelly ;
Wang, Hairong ;
Okuboyejo, Olaperi .
ARTIFICIAL INTELLIGENCE RESEARCH, SACAIR 2024, 2025, 2326 :3-20
[23]   Content-Dependent Fine-Grained Speaker Embedding for Zero-Shot Speaker Adaptation in Text-to-Speech Synthesis [J].
Zhou, Yixuan ;
Song, Changhe ;
Li, Xiang ;
Zhang, Luwen ;
Wu, Zhiyong ;
Bian, Yanyao ;
Su, Dan ;
Meng, Helen .
INTERSPEECH 2022, 2022, :2573-2577
[24]   Face2Speech: Towards Multi-Speaker Text-to-Speech Synthesis Using an Embedding Vector Predicted from a Face Image [J].
Goto, Shunsuke ;
Onishi, Kotaro ;
Saito, Yuki ;
Tachibana, Kentaro ;
Mori, Koichiro .
INTERSPEECH 2020, 2020, :1321-1325
[25]   Chain-of-LoRA: Enhancing the Instruction Fine-Tuning Performance of Low-Rank Adaptation on Diverse Instruction Set [J].
Qiu, Xihe ;
Hao, Teqi ;
Shi, Shaojie ;
Tan, Xiaoyu ;
Xiong, Yu-Jie .
IEEE SIGNAL PROCESSING LETTERS, 2024, 31 :875-879
[26]   TDASS: Target Domain Adaptation Speech Synthesis Framework for Multi-speaker Low-Resource TTS [J].
Zhang, Xulong ;
Wang, Jianzong ;
Cheng, Ning ;
Xiao, Jing .
2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,