Connectionist temporal classification loss for vector quantized variational autoencoder in zero-shot voice conversion

被引:6
|
作者
Kang, Xiao [1 ]
Huang, Hao [1 ,2 ]
Hu, Ying [1 ]
Huang, Zhihua [1 ]
机构
[1] Xinjiang Univ, Sch Informat Sci & Engn, Urumqi, Peoples R China
[2] Xinjiang Prov Key Lab Multilingual Informat Techn, Urumqi, Peoples R China
基金
国家重点研发计划;
关键词
Voice conversion; Zero-shot; VQ-VAE; Connectionist temporal classification; NEURAL-NETWORKS;
D O I
10.1016/j.dsp.2021.103110
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Vector quantized variational autoencoder (VQ-VAE) has recently become an increasingly popular method in non-parallel zero-shot voice conversion (VC). The reason behind is that VQ-VAE is capable of disentangling the content and the speaker representations from the speech by using a content encoder and a speaker encoder, which is suitable for the VC task that makes the speech of a source speaker sound like the speech of the target speaker without changing the linguistic content. However, the converted speech is not satisfying because it is difficult to disentangle the pure content representations from the acoustic features due to the lack of linguistic supervision for the content encoder. To address this issue, under the framework of VQ-VAE, connectionist temporal classification (CTC) loss is proposed to guide the content encoder to learn pure content representations by using an auxiliary network. Based on the fact that the CTC loss is not affected by the sequence length of the output of the content encoder, adding the linguistic supervision to the content encoder can be much easier. This non-parallel many-to-many voice conversion model is named as CTC-VQ-VAE. VC experiments on the CMU ARCTIC and VCTK corpus are carried out to evaluate the proposed method. Both the objective and the subjective results show that the proposed approach significantly improves the speech quality and speaker similarity of the converted speech, compared with the traditional VQ-VAE method. (C) 2021 Elsevier Inc. All rights reserved.
引用
收藏
页数:10
相关论文
共 50 条
  • [11] A Discriminative Cross-Aligned Variational Autoencoder for Zero-Shot Learning
    Liu, Yang
    Gao, Xinbo
    Han, Jungong
    Shao, Ling
    IEEE TRANSACTIONS ON CYBERNETICS, 2023, 53 (06) : 3794 - 3805
  • [12] YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for Everyone
    Casanova, Edresson
    Weber, Julian
    Shulby, Christopher
    Candido Junior, Arnaldo
    Goelge, Eren
    Ponti, Moacir Antonelli
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
  • [13] Towards Improved Zero-shot Voice Conversion with Conditional DSVAE
    Lian, Jiachen
    Zhang, Chunlei
    Anumanchipalli, Gopala Krishna
    Yu, Dong
    INTERSPEECH 2022, 2022, : 2598 - 2602
  • [14] Zero-Shot Unseen Speaker Anonymization via Voice Conversion
    Chang, Hyung-Pil
    Yoo, In-Chul
    Jeong, Changhyeon
    Yook, Dongsuk
    IEEE ACCESS, 2022, 10 : 130190 - 130199
  • [15] Multi-Level Temporal-Channel Speaker Retrieval for Zero-Shot Voice Conversion
    Wang, Zhichao
    Xue, Liumeng
    Kong, Qiuqiang
    Xie, Lei
    Chen, Yuanzhe
    Tian, Qiao
    Wang, Yuping
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 2926 - 2937
  • [16] CRANK: AN OPEN-SOURCE SOFTWARE FOR NONPARALLEL VOICE CONVERSION BASED ON VECTOR-QUANTIZED VARIATIONAL AUTOENCODER
    Kobayashi, Kazuhiro
    Huang, Wen-Chin
    Wu, Yi-Chiao
    Tobing, Patrick Lumban
    Hayashi, Tomoki
    Toda, Tomoki
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 5934 - 5938
  • [17] Improved Zero-Shot Voice Conversion Using Explicit Conditioning Signals
    Nercessian, Shahan
    INTERSPEECH 2020, 2020, : 4711 - 4715
  • [18] CLAREL: Classification via retrieval loss for zero-shot learning
    Oreshkin, Boris N.
    Rostamzadeh, Negar
    Pinheiro, Pedro O.
    Pal, Christopher
    2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW 2020), 2020, : 3989 - 3993
  • [19] ZERO-SHOT VOICE CONVERSION WITH ADJUSTED SPEAKER EMBEDDINGS AND SIMPLE ACOUSTIC FEATURES
    Tan, Zhiyuan
    Wei, Jianguo
    Xu, Junhai
    He, Yuqing
    Lu, Wenhuan
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 5964 - 5968
  • [20] END-TO-END ZERO-SHOT VOICE CONVERSION USING A DDSP VOCODER
    Nercessian, Shahan
    2021 IEEE WORKSHOP ON APPLICATIONS OF SIGNAL PROCESSING TO AUDIO AND ACOUSTICS (WASPAA), 2021, : 306 - 310