Leveraging Low-Rank Adaptation for Parameter-Efficient Fine-Tuning in Multi-Speaker Adaptive Text-to-Speech Synthesis

被引：0

作者：

Hong, Changi ^{[1
]}

Lee, Jung Hyuk ^{[2
]}

Kim, Hong Kook ^{[1
,2
,3
]}

机构：

[1] Gwangju Inst Sci & Technol, AI Grad Sch, Gwangju 61005, South Korea

[2] Gwangju Inst Sci & Technol, Sch Elect Engn & Comp Sci, Gwangju 61005, South Korea

[3] AunionAI Co Ltd, Gwangju 61005, South Korea

来源：

IEEE ACCESS | 2024年 / 12卷

关键词：

Adaptation models; Predictive models; Computational modeling; Acoustics; Training; Data models; Tuning; Text to speech; Load modeling; Zero shot learning; Text-to-speech synthesis; low-rank adaptation; multi-speaker adaptation; parameter-efficient fine-tuning; residual adapter; conditional layer normalization; variational inference with adversarial learning;

D O I：

10.1109/ACCESS.2024.3515206

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Text-to-speech (TTS) technology is commonly used to generate personalized voices for new speakers. Despite considerable progress in TTS technology, personal voice synthesis remains problematic in achieving high-quality custom voices. In addressing this issue, fine-tuning a TTS model is a popular approach. However, it must be applied once for every new speaker, which results in both time-consuming model training and excessive storage of the TTS model parameters. Therefore, to support a large number of new speakers, a parameter-efficient fine-tuning (PEFT) approach must be used instead of full fine-tuning, as well as an approach to accommodate multiple speakers with a small number of parameters. To this end, this work first incorporates a low-rank adaptation-based fine-tuning method for variational inference with adversarial learning for end-to-end TTS (VITS) model. Next, the approach is extended with conditional layer normalization for multi-speaker fine-tuning, and the residual adapter is further applied to the text encoder outputs of the VITS model to improve the intelligibility and naturalness of the speech quality of personalized speech. The performance of the fine-tuned TTS models with different combinations of fine-tuning modules is evaluated using the Libri-TTS-100, VCTK, and Common Voice datasets, as well as a Korean multi-speaker dataset. Objective and subjective quality comparisons reveal that the proposed approach achieves speech quality comparable to that of a fully fine-tuned model, with around a 90% reduction in the number of model parameters.

引用

页码：190711 / 190727

页数：17

共 41 条

[1] Structure-Aware Low-Rank Adaptation for Parameter-Efficient Fine-Tuning
Hu, Yahao
Xie, Yifei
Wang, Tianfeng
Chen, Man
Pan, Zhisong
MATHEMATICS, 2023, 11 (20)
[2] LoRAPrune: Structured Pruning Meets Low-Rank Parameter-Efficient Fine-Tuning
Zhang, Mingyang
Chen, Hao
Shen, Chunhua
Yang, Zhen
Ou, Linlin
Yu, Xinyi
Zhuang, Bohan
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 3013 - 3026
[3] Hydra: Multi-head low-rank adaptation for parameter efficient fine-tuning
Kim, Sanghyeon
Yang, Hyunmo
Kim, Yunghyun
Hong, Youngjoon
Park, Eunbyung
NEURAL NETWORKS, 2024, 178
[4] Parameter-Efficient Learning for Text-to-Speech Accent Adaptation
Yang, Li-Jen
Yang, Chao-Han Huck
Chien, Jen-Tzung
INTERSPEECH 2023, 2023, : 4354 - 4358
[5] Introducing Routing Functions to Vision-Language Parameter-Efficient Fine-Tuning with Low-Rank Bottlenecks
Qui, Tingyu
Tuytelaars, Tinne
Moens, Marie-Francine
COMPUTER VISION - ECCV 2024, PT LXXXVIII, 2025, 15146 : 291 - 308
[6] Leveraging Parameter-Efficient Fine-Tuning for Multilingual Abstractive Summarization
Shen, Jialun
Wang, Yusong
NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, PT III, NLPCC 2024, 2025, 15361 : 293 - 303
[7] ECAPA-TDNN for Multi-speaker Text-to-speech Synthesis
Xue, Jinlong
Deng, Yayue
Han, Yichen
Li, Ya
Sun, Jianqing
Liang, Jiaen
2022 13TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2022, : 230 - 234
[8] Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation
Min, Dongchan
Lee, Dong Bok
Yang, Eunho
Hwang, Sung Ju
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139, 2021, 139
[9] AFLoRA: Adaptive Freezing of Low Rank Adaptation in Parameter Efficient Fine-Tuning of Large Models
Lin, Zeyu
Kundu, Souvik
Li, Anni
Wan, Junrui
Jiang, Lianghao
Beerell, Peter A.
PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 2: SHORT PAPERS, 2024, : 161 - 167
[10] Multi-speaker Text-to-speech Synthesis Using Deep Gaussian Processes
Mitsui, Kentaro
Koriyama, Tomoki
Saruwatari, Hiroshi
INTERSPEECH 2020, 2020, : 2032 - 2036

← 1 2 3 4 5 →