Zero-shot Voice Conversion via Self-supervised Prosody Representation Learning

被引：1

作者：

Wang, Shijun ^{[1
]}

Borth, Damian ^{[1
]}

机构：

[1] Univ St Gallen, Sch Comp Sci, AIML Lab, St Gallen, Switzerland

来源：

2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN) | 2022年

关键词：

Zero-Shot Voice Conversion; Self-Supervised Learning; Disentanglement Representation Learning;

D O I：

10.1109/IJCNN55064.2022.9892405

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Voice Conversion (VC) for unseen speakers, also known as zero-shot VC, is an attractive research topic as it enables a range of applications like voice customizing, animation production, and others. Recent work in this area made progress with disentanglement methods that separate utterance content and speaker characteristics from speech audio recordings. However, many of these methods are subject to the leakage of prosody (e.g., pitch, volume), causing the speaker voice in the synthesized speech to be different from the desired target speakers. To prevent this issue, we propose a novel self-supervised approach that effectively learns disentangled pitch and volume representations that can represent the prosody styles of different speakers. We then use the learned prosodic representations as conditional information to train and enhance our VC model for zero-shot conversion. In our experiments, we show that our prosody representations are disentangled and rich in prosody information. Moreover, we demonstrate that the addition of our prosody representations improves our VC performance and surpasses state-of-the-art zero-shot VC performances.

引用

页数：8

共 50 条

[1] TRAINING ROBUST ZERO-SHOT VOICE CONVERSION MODELS WITH SELF-SUPERVISED FEATURES
Trung Dang
Dung Tran
Chin, Peter
Koishida, Kazuhito
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6557 - 6561
[2] Two-stage and Self-supervised Voice Conversion for Zero-Shot Dysarthric Speech Reconstruction
Liu, Dong
Lin, Yueqian
Bu, Hui
Li, Ming
2024 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING, IALP 2024, 2024, : 423 - 427
[3] Zero-Shot Text Classification via Self-Supervised Tuning
Liu, Chaoqun
Zhang, Wenxuan
Chen, Guizhen
Wu, Xiaobao
Luu, Anh Tuan
Chang, Chip Hong
Bing, Lidong
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, 2023, : 1743 - 1761
[4] Self-Supervised Knowledge Triplet Learning for Zero-Shot Question Answering
Banerjee, Pratyay
Baral, Chitta
PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 151 - 162
[5] ROBUST DISENTANGLED VARIATIONAL SPEECH REPRESENTATION LEARNING FOR ZERO-SHOT VOICE CONVERSION
Lian, Jiachen
Zhang, Chunlei
Yu, Dong
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6572 - 6576
[6] Self-Supervised Remote Sensing Image Dehazing Network Based on Zero-Shot Learning
Wei, Jianchong
Cao, Yan
Yang, Kunping
Chen, Liang
Wu, Yi
REMOTE SENSING, 2023, 15 (11)
[7] Information Retrieval from Alternative Data using Zero-Shot Self-Supervised Learning
Assareh, Amin
2022 IEEE SYMPOSIUM ON COMPUTATIONAL INTELLIGENCE FOR FINANCIAL ENGINEERING AND ECONOMICS (CIFER), 2022,
[8] Self-supervised embedding for generalized zero-shot learning in remote sensing scene classification
Damalla, Rambabu
Datla, Rajeshreddy
Vishnu, Chalavadi
Mohan, Chalavadi Krishna
JOURNAL OF APPLIED REMOTE SENSING, 2023, 17 (03)
[9] Prototype-Augmented Self-Supervised Generative Network for Generalized Zero-Shot Learning
Wu, Jiamin
Zhang, Tianzhu
Zha, Zheng-Jun
Luo, Jiebo
Zhang, Yongdong
Wu, Feng
IEEE Transactions on Image Processing, 2024, 33 : 1938 - 1951
[10] Prototype-Augmented Self-Supervised Generative Network for Generalized Zero-Shot Learning
Wu, Jiamin
Zhang, Tianzhu
Zha, Zheng-Jun
Luo, Jiebo
Zhang, Yongdong
Wu, Feng
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2024, 33 : 1938 - 1951

← 1 2 3 4 5 →