Zero-shot Voice Conversion via Self-supervised Prosody Representation Learning

被引:1
|
作者
Wang, Shijun [1 ]
Borth, Damian [1 ]
机构
[1] Univ St Gallen, Sch Comp Sci, AIML Lab, St Gallen, Switzerland
来源
2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN) | 2022年
关键词
Zero-Shot Voice Conversion; Self-Supervised Learning; Disentanglement Representation Learning;
D O I
10.1109/IJCNN55064.2022.9892405
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Voice Conversion (VC) for unseen speakers, also known as zero-shot VC, is an attractive research topic as it enables a range of applications like voice customizing, animation production, and others. Recent work in this area made progress with disentanglement methods that separate utterance content and speaker characteristics from speech audio recordings. However, many of these methods are subject to the leakage of prosody (e.g., pitch, volume), causing the speaker voice in the synthesized speech to be different from the desired target speakers. To prevent this issue, we propose a novel self-supervised approach that effectively learns disentangled pitch and volume representations that can represent the prosody styles of different speakers. We then use the learned prosodic representations as conditional information to train and enhance our VC model for zero-shot conversion. In our experiments, we show that our prosody representations are disentangled and rich in prosody information. Moreover, we demonstrate that the addition of our prosody representations improves our VC performance and surpasses state-of-the-art zero-shot VC performances.
引用
收藏
页数:8
相关论文
共 50 条
  • [1] TRAINING ROBUST ZERO-SHOT VOICE CONVERSION MODELS WITH SELF-SUPERVISED FEATURES
    Trung Dang
    Dung Tran
    Chin, Peter
    Koishida, Kazuhito
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6557 - 6561
  • [2] Two-stage and Self-supervised Voice Conversion for Zero-Shot Dysarthric Speech Reconstruction
    Liu, Dong
    Lin, Yueqian
    Bu, Hui
    Li, Ming
    2024 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING, IALP 2024, 2024, : 423 - 427
  • [3] Zero-Shot Text Classification via Self-Supervised Tuning
    Liu, Chaoqun
    Zhang, Wenxuan
    Chen, Guizhen
    Wu, Xiaobao
    Luu, Anh Tuan
    Chang, Chip Hong
    Bing, Lidong
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, 2023, : 1743 - 1761
  • [4] Self-Supervised Knowledge Triplet Learning for Zero-Shot Question Answering
    Banerjee, Pratyay
    Baral, Chitta
    PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 151 - 162
  • [5] ROBUST DISENTANGLED VARIATIONAL SPEECH REPRESENTATION LEARNING FOR ZERO-SHOT VOICE CONVERSION
    Lian, Jiachen
    Zhang, Chunlei
    Yu, Dong
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6572 - 6576
  • [6] Self-Supervised Remote Sensing Image Dehazing Network Based on Zero-Shot Learning
    Wei, Jianchong
    Cao, Yan
    Yang, Kunping
    Chen, Liang
    Wu, Yi
    REMOTE SENSING, 2023, 15 (11)
  • [7] Information Retrieval from Alternative Data using Zero-Shot Self-Supervised Learning
    Assareh, Amin
    2022 IEEE SYMPOSIUM ON COMPUTATIONAL INTELLIGENCE FOR FINANCIAL ENGINEERING AND ECONOMICS (CIFER), 2022,
  • [8] Self-supervised embedding for generalized zero-shot learning in remote sensing scene classification
    Damalla, Rambabu
    Datla, Rajeshreddy
    Vishnu, Chalavadi
    Mohan, Chalavadi Krishna
    JOURNAL OF APPLIED REMOTE SENSING, 2023, 17 (03)
  • [9] Prototype-Augmented Self-Supervised Generative Network for Generalized Zero-Shot Learning
    Wu, Jiamin
    Zhang, Tianzhu
    Zha, Zheng-Jun
    Luo, Jiebo
    Zhang, Yongdong
    Wu, Feng
    IEEE Transactions on Image Processing, 2024, 33 : 1938 - 1951
  • [10] Prototype-Augmented Self-Supervised Generative Network for Generalized Zero-Shot Learning
    Wu, Jiamin
    Zhang, Tianzhu
    Zha, Zheng-Jun
    Luo, Jiebo
    Zhang, Yongdong
    Wu, Feng
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2024, 33 : 1938 - 1951