Connectionist temporal classification loss for vector quantized variational autoencoder in zero-shot voice conversion

被引:6
|
作者
Kang, Xiao [1 ]
Huang, Hao [1 ,2 ]
Hu, Ying [1 ]
Huang, Zhihua [1 ]
机构
[1] Xinjiang Univ, Sch Informat Sci & Engn, Urumqi, Peoples R China
[2] Xinjiang Prov Key Lab Multilingual Informat Techn, Urumqi, Peoples R China
基金
国家重点研发计划;
关键词
Voice conversion; Zero-shot; VQ-VAE; Connectionist temporal classification; NEURAL-NETWORKS;
D O I
10.1016/j.dsp.2021.103110
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Vector quantized variational autoencoder (VQ-VAE) has recently become an increasingly popular method in non-parallel zero-shot voice conversion (VC). The reason behind is that VQ-VAE is capable of disentangling the content and the speaker representations from the speech by using a content encoder and a speaker encoder, which is suitable for the VC task that makes the speech of a source speaker sound like the speech of the target speaker without changing the linguistic content. However, the converted speech is not satisfying because it is difficult to disentangle the pure content representations from the acoustic features due to the lack of linguistic supervision for the content encoder. To address this issue, under the framework of VQ-VAE, connectionist temporal classification (CTC) loss is proposed to guide the content encoder to learn pure content representations by using an auxiliary network. Based on the fact that the CTC loss is not affected by the sequence length of the output of the content encoder, adding the linguistic supervision to the content encoder can be much easier. This non-parallel many-to-many voice conversion model is named as CTC-VQ-VAE. VC experiments on the CMU ARCTIC and VCTK corpus are carried out to evaluate the proposed method. Both the objective and the subjective results show that the proposed approach significantly improves the speech quality and speaker similarity of the converted speech, compared with the traditional VQ-VAE method. (C) 2021 Elsevier Inc. All rights reserved.
引用
收藏
页数:10
相关论文
共 50 条
  • [41] WESPER: Zero-shot and Realtime Whisper to Normal Voice Conversion for Whisper-based Speech Interactions
    Rekimoto, Jun
    PROCEEDINGS OF THE 2023 CHI CONFERENCE ON HUMAN FACTORS IN COMPUTING SYSTEMS, CHI 2023, 2023,
  • [42] DeID-VC: Speaker De-identification via Zero-shot Pseudo Voice Conversion
    Yuan, Ruibin
    Wu, Yuxuan
    Li, Jacob
    Kim, Jaxter
    INTERSPEECH 2022, 2022, : 2593 - 2597
  • [43] LM-VC: Zero-Shot Voice Conversion via Speech Generation Based on Language Models
    Wang Z.
    Chen Y.
    Xie L.
    Tian Q.
    Wang Y.
    IEEE Signal Processing Letters, 2023, 30 : 1157 - 1161
  • [44] Improvement Speaker Similarity for Zero-Shot Any-to-Any Voice Conversion of Whispered and Regular Speech
    Avdeeva, Anastasia
    Gusev, Aleksei
    INTERSPEECH 2024, 2024, : 2735 - 2739
  • [45] CSLP-AE: A Contrastive Split-Latent Permutation Autoencoder Framework for Zero-Shot Electroencephalography Signal Conversion
    Norskov, Anders Vestergaard
    Zahid, Alexander Neergaard
    Morup, Morten
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [46] Utilizing Adaptive Global Response Normalization and Cluster-Based Pseudo Labels for Zero-Shot Voice Conversion
    Um, Ji Sub
    Kim, Hoirin
    INTERSPEECH 2024, 2024, : 2740 - 2744
  • [47] GAZEV: GAN-Based Zero-Shot Voice Conversion over Non-parallel Speech Corpus
    Zhang, Zining
    He, Bingsheng
    Zhang, Zhenjie
    INTERSPEECH 2020, 2020, : 791 - 795
  • [48] StreamVoice: Streamable Context-Aware Language Modeling for Real-time Zero-Shot Voice Conversion
    Wang, Zhichao
    Chen, Yuanzhe
    Wang, Xinsheng
    Xie, Lei
    Wang, Yuping
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 7328 - 7338
  • [49] Zero-Shot Classification Method for Remote-Sensing Scenes Based on Word Vector Consistent Fusion
    Wu Chen
    Yu Guang
    Zhang Fengjing
    Liu Yu
    Yuan Yuwei
    Quan Jicheng
    ACTA OPTICA SINICA, 2019, 39 (08)
  • [50] Zero-Shot Classification Method for Remote-Sensing Scenes Based on Word Vector Consistent Fusion
    Wu C.
    Yu G.
    Zhang F.
    Liu Y.
    Yuan Y.
    Quan J.
    Guangxue Xuebao/Acta Optica Sinica, 2019, 39 (08):