Zero-shot Voice Conversion via Self-supervised Prosody Representation Learning

被引：1

作者：

Wang, Shijun ^{[1
]}

Borth, Damian ^{[1
]}

机构：

[1] Univ St Gallen, Sch Comp Sci, AIML Lab, St Gallen, Switzerland

来源：

2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN) | 2022年

关键词：

Zero-Shot Voice Conversion; Self-Supervised Learning; Disentanglement Representation Learning;

D O I：

10.1109/IJCNN55064.2022.9892405

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Voice Conversion (VC) for unseen speakers, also known as zero-shot VC, is an attractive research topic as it enables a range of applications like voice customizing, animation production, and others. Recent work in this area made progress with disentanglement methods that separate utterance content and speaker characteristics from speech audio recordings. However, many of these methods are subject to the leakage of prosody (e.g., pitch, volume), causing the speaker voice in the synthesized speech to be different from the desired target speakers. To prevent this issue, we propose a novel self-supervised approach that effectively learns disentangled pitch and volume representations that can represent the prosody styles of different speakers. We then use the learned prosodic representations as conditional information to train and enhance our VC model for zero-shot conversion. In our experiments, we show that our prosody representations are disentangled and rich in prosody information. Moreover, we demonstrate that the addition of our prosody representations improves our VC performance and surpasses state-of-the-art zero-shot VC performances.

引用

页数：8

共 50 条

[21] Self-supervised zero-shot dehazing network based on dark channel prior
Xinjie Xiao
Yuanhong Ren
Zhiwei Li
Nannan Zhang
Wuneng Zhou
Frontiers of Optoelectronics, 16
[22] Data Consistent Variational Networks for Zero-shot Self-supervised MR Reconstruction
Fuernrohr, Florian
Wetzl, Jens
Vornehm, Marc
Giese, Daniel
Knoll, Florian
BILDVERARBEITUNG FUR DIE MEDIZIN 2024, 2024, : 316 - 321
[23] Self-supervised zero-shot dehazing network based on dark channel prior
Xiao, Xinjie
Ren, Yuanhong
Li, Zhiwei
Zhang, Nannan
Zhou, Wuneng
FRONTIERS OF OPTOELECTRONICS, 2023, 16 (01)
[24] Self-supervised zero-shot dehazing network based on dark channel prior
Xinjie Xiao
Yuanhong Ren
Zhiwei Li
Nannan Zhang
Wuneng Zhou
Frontiers of Optoelectronics, 2023, 16 (01) : 98 - 111
[25] Transductive zero-shot image classification based on self-supervised enhancement feature
Wang H.-Y.
Zhang X.-R.
Wang X.-S.
Cheng Y.-H.
Kongzhi yu Juece/Control and Decision, 2024, 39 (05): : 1707 - 1717
[26] Zero-Shot Program Representation Learning
Cui, Nan
Jiang, Yuze
Gu, Xiaodong
Shen, Beijun
30TH IEEE/ACM INTERNATIONAL CONFERENCE ON PROGRAM COMPREHENSION (ICPC 2022), 2022, : 60 - 70
[27] Zero-Shot Code Representation Learning via Prompt Tuning
Cui, Nan
Gu, Xiaodong
Shen, Beijun
arXiv,
[28] Zero-Shot Program Representation Learning
Cui, Nan
Jiang, Yuze
Gu, Xiaodong
Shen, Beijun
arXiv, 2022,
[29] Zero-Shot Learning for Intrusion Detection via Attribute Representation
Li, Zhipeng
Qin, Zheng
Shen, Pengbo
Jiang, Liu
NEURAL INFORMATION PROCESSING (ICONIP 2019), PT I, 2019, 11953 : 352 - 364
[30] A Comparative Study of Self-Supervised Speech Representation Based Voice Conversion
Huang, Wen-Chin
Yang, Shu-Wen
Hayashi, Tomoki
Toda, Tomoki
IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2022, 16 (06) : 1308 - 1318

← 1 2 3 4 5 →