Leveraging Low-Rank Adaptation for Parameter-Efficient Fine-Tuning in Multi-Speaker Adaptive Text-to-Speech Synthesis

被引:0
|
作者
Hong, Changi [1 ]
Lee, Jung Hyuk [2 ]
Kim, Hong Kook [1 ,2 ,3 ]
机构
[1] Gwangju Inst Sci & Technol, AI Grad Sch, Gwangju 61005, South Korea
[2] Gwangju Inst Sci & Technol, Sch Elect Engn & Comp Sci, Gwangju 61005, South Korea
[3] AunionAI Co Ltd, Gwangju 61005, South Korea
来源
IEEE ACCESS | 2024年 / 12卷
关键词
Adaptation models; Predictive models; Computational modeling; Acoustics; Training; Data models; Tuning; Text to speech; Load modeling; Zero shot learning; Text-to-speech synthesis; low-rank adaptation; multi-speaker adaptation; parameter-efficient fine-tuning; residual adapter; conditional layer normalization; variational inference with adversarial learning;
D O I
10.1109/ACCESS.2024.3515206
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Text-to-speech (TTS) technology is commonly used to generate personalized voices for new speakers. Despite considerable progress in TTS technology, personal voice synthesis remains problematic in achieving high-quality custom voices. In addressing this issue, fine-tuning a TTS model is a popular approach. However, it must be applied once for every new speaker, which results in both time-consuming model training and excessive storage of the TTS model parameters. Therefore, to support a large number of new speakers, a parameter-efficient fine-tuning (PEFT) approach must be used instead of full fine-tuning, as well as an approach to accommodate multiple speakers with a small number of parameters. To this end, this work first incorporates a low-rank adaptation-based fine-tuning method for variational inference with adversarial learning for end-to-end TTS (VITS) model. Next, the approach is extended with conditional layer normalization for multi-speaker fine-tuning, and the residual adapter is further applied to the text encoder outputs of the VITS model to improve the intelligibility and naturalness of the speech quality of personalized speech. The performance of the fine-tuned TTS models with different combinations of fine-tuning modules is evaluated using the Libri-TTS-100, VCTK, and Common Voice datasets, as well as a Korean multi-speaker dataset. Objective and subjective quality comparisons reveal that the proposed approach achieves speech quality comparable to that of a fully fine-tuned model, with around a 90% reduction in the number of model parameters.
引用
收藏
页码:190711 / 190727
页数:17
相关论文
共 41 条
  • [1] Structure-Aware Low-Rank Adaptation for Parameter-Efficient Fine-Tuning
    Hu, Yahao
    Xie, Yifei
    Wang, Tianfeng
    Chen, Man
    Pan, Zhisong
    MATHEMATICS, 2023, 11 (20)
  • [2] LoRAPrune: Structured Pruning Meets Low-Rank Parameter-Efficient Fine-Tuning
    Zhang, Mingyang
    Chen, Hao
    Shen, Chunhua
    Yang, Zhen
    Ou, Linlin
    Yu, Xinyi
    Zhuang, Bohan
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 3013 - 3026
  • [3] Hydra: Multi-head low-rank adaptation for parameter efficient fine-tuning
    Kim, Sanghyeon
    Yang, Hyunmo
    Kim, Yunghyun
    Hong, Youngjoon
    Park, Eunbyung
    NEURAL NETWORKS, 2024, 178
  • [4] Parameter-Efficient Learning for Text-to-Speech Accent Adaptation
    Yang, Li-Jen
    Yang, Chao-Han Huck
    Chien, Jen-Tzung
    INTERSPEECH 2023, 2023, : 4354 - 4358
  • [5] Introducing Routing Functions to Vision-Language Parameter-Efficient Fine-Tuning with Low-Rank Bottlenecks
    Qui, Tingyu
    Tuytelaars, Tinne
    Moens, Marie-Francine
    COMPUTER VISION - ECCV 2024, PT LXXXVIII, 2025, 15146 : 291 - 308
  • [6] Leveraging Parameter-Efficient Fine-Tuning for Multilingual Abstractive Summarization
    Shen, Jialun
    Wang, Yusong
    NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, PT III, NLPCC 2024, 2025, 15361 : 293 - 303
  • [7] ECAPA-TDNN for Multi-speaker Text-to-speech Synthesis
    Xue, Jinlong
    Deng, Yayue
    Han, Yichen
    Li, Ya
    Sun, Jianqing
    Liang, Jiaen
    2022 13TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2022, : 230 - 234
  • [8] Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation
    Min, Dongchan
    Lee, Dong Bok
    Yang, Eunho
    Hwang, Sung Ju
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139, 2021, 139
  • [9] AFLoRA: Adaptive Freezing of Low Rank Adaptation in Parameter Efficient Fine-Tuning of Large Models
    Lin, Zeyu
    Kundu, Souvik
    Li, Anni
    Wan, Junrui
    Jiang, Lianghao
    Beerell, Peter A.
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 2: SHORT PAPERS, 2024, : 161 - 167
  • [10] Multi-speaker Text-to-speech Synthesis Using Deep Gaussian Processes
    Mitsui, Kentaro
    Koriyama, Tomoki
    Saruwatari, Hiroshi
    INTERSPEECH 2020, 2020, : 2032 - 2036