Automatic Tuning of Loss Trade-offs without Hyper-parameter Search in End-to-End Zero-Shot Speech Synthesis
被引:1
作者:
Park, Seongyeon
论文数: 0引用数: 0
h-index: 0
机构:
CNAI, Seoul, South KoreaCNAI, Seoul, South Korea
Park, Seongyeon
[1
]
Kim, Bohyung
论文数: 0引用数: 0
h-index: 0
机构:
CNAI, Seoul, South KoreaCNAI, Seoul, South Korea
Kim, Bohyung
[1
]
Oh, Tae-Hyun
论文数: 0引用数: 0
h-index: 0
机构:
Yonsei Univ, Inst Convergence Res & Educ Adv Technol, Seoul, South Korea
POSTECH, Dept EE, Pohang, South Korea
POSTECH, GSAI, Pohang, South KoreaCNAI, Seoul, South Korea
Oh, Tae-Hyun
[2
,3
,4
]
机构:
[1] CNAI, Seoul, South Korea
[2] Yonsei Univ, Inst Convergence Res & Educ Adv Technol, Seoul, South Korea
Recently, zero-shot TTS and VC methods have gained attention due to their practicality of being able to generate voices even unseen during training. Among these methods, zero-shot modifications of the VITS model have shown superior performance, while having useful properties inherited from VITS. However, the performance of VITS and VITS-based zero-shot models vary dramatically depending on how the losses are balanced. This can be problematic, as it requires a burdensome procedure of tuning loss balance hyper-parameters to find the optimal balance. In this work, we propose a novel framework that finds this optimum without search, by inducing the decoder of VITS-based models to its full reconstruction ability. With our framework, we show superior performance compared to baselines in zero-shot TTS and VC, achieving state-of-the-art performance. Furthermore, we show the robustness of our framework in various settings. We provide an explanation for the results in the discussion.