Zero-shot Voice Conversion via Self-supervised Prosody Representation Learning

被引:1
|
作者
Wang, Shijun [1 ]
Borth, Damian [1 ]
机构
[1] Univ St Gallen, Sch Comp Sci, AIML Lab, St Gallen, Switzerland
来源
2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN) | 2022年
关键词
Zero-Shot Voice Conversion; Self-Supervised Learning; Disentanglement Representation Learning;
D O I
10.1109/IJCNN55064.2022.9892405
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Voice Conversion (VC) for unseen speakers, also known as zero-shot VC, is an attractive research topic as it enables a range of applications like voice customizing, animation production, and others. Recent work in this area made progress with disentanglement methods that separate utterance content and speaker characteristics from speech audio recordings. However, many of these methods are subject to the leakage of prosody (e.g., pitch, volume), causing the speaker voice in the synthesized speech to be different from the desired target speakers. To prevent this issue, we propose a novel self-supervised approach that effectively learns disentangled pitch and volume representations that can represent the prosody styles of different speakers. We then use the learned prosodic representations as conditional information to train and enhance our VC model for zero-shot conversion. In our experiments, we show that our prosody representations are disentangled and rich in prosody information. Moreover, we demonstrate that the addition of our prosody representations improves our VC performance and surpasses state-of-the-art zero-shot VC performances.
引用
收藏
页数:8
相关论文
共 50 条
  • [21] Self-supervised zero-shot dehazing network based on dark channel prior
    Xinjie Xiao
    Yuanhong Ren
    Zhiwei Li
    Nannan Zhang
    Wuneng Zhou
    Frontiers of Optoelectronics, 16
  • [22] Data Consistent Variational Networks for Zero-shot Self-supervised MR Reconstruction
    Fuernrohr, Florian
    Wetzl, Jens
    Vornehm, Marc
    Giese, Daniel
    Knoll, Florian
    BILDVERARBEITUNG FUR DIE MEDIZIN 2024, 2024, : 316 - 321
  • [23] Self-supervised zero-shot dehazing network based on dark channel prior
    Xiao, Xinjie
    Ren, Yuanhong
    Li, Zhiwei
    Zhang, Nannan
    Zhou, Wuneng
    FRONTIERS OF OPTOELECTRONICS, 2023, 16 (01)
  • [24] Self-supervised zero-shot dehazing network based on dark channel prior
    Xinjie Xiao
    Yuanhong Ren
    Zhiwei Li
    Nannan Zhang
    Wuneng Zhou
    Frontiers of Optoelectronics, 2023, 16 (01) : 98 - 111
  • [25] Transductive zero-shot image classification based on self-supervised enhancement feature
    Wang H.-Y.
    Zhang X.-R.
    Wang X.-S.
    Cheng Y.-H.
    Kongzhi yu Juece/Control and Decision, 2024, 39 (05): : 1707 - 1717
  • [26] Zero-Shot Program Representation Learning
    Cui, Nan
    Jiang, Yuze
    Gu, Xiaodong
    Shen, Beijun
    30TH IEEE/ACM INTERNATIONAL CONFERENCE ON PROGRAM COMPREHENSION (ICPC 2022), 2022, : 60 - 70
  • [27] Zero-Shot Code Representation Learning via Prompt Tuning
    Cui, Nan
    Gu, Xiaodong
    Shen, Beijun
    arXiv,
  • [28] Zero-Shot Program Representation Learning
    Cui, Nan
    Jiang, Yuze
    Gu, Xiaodong
    Shen, Beijun
    arXiv, 2022,
  • [29] Zero-Shot Learning for Intrusion Detection via Attribute Representation
    Li, Zhipeng
    Qin, Zheng
    Shen, Pengbo
    Jiang, Liu
    NEURAL INFORMATION PROCESSING (ICONIP 2019), PT I, 2019, 11953 : 352 - 364
  • [30] A Comparative Study of Self-Supervised Speech Representation Based Voice Conversion
    Huang, Wen-Chin
    Yang, Shu-Wen
    Hayashi, Tomoki
    Toda, Tomoki
    IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2022, 16 (06) : 1308 - 1318