Connectionist temporal classification loss for vector quantized variational autoencoder in zero-shot voice conversion

被引:6
|
作者
Kang, Xiao [1 ]
Huang, Hao [1 ,2 ]
Hu, Ying [1 ]
Huang, Zhihua [1 ]
机构
[1] Xinjiang Univ, Sch Informat Sci & Engn, Urumqi, Peoples R China
[2] Xinjiang Prov Key Lab Multilingual Informat Techn, Urumqi, Peoples R China
基金
国家重点研发计划;
关键词
Voice conversion; Zero-shot; VQ-VAE; Connectionist temporal classification; NEURAL-NETWORKS;
D O I
10.1016/j.dsp.2021.103110
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Vector quantized variational autoencoder (VQ-VAE) has recently become an increasingly popular method in non-parallel zero-shot voice conversion (VC). The reason behind is that VQ-VAE is capable of disentangling the content and the speaker representations from the speech by using a content encoder and a speaker encoder, which is suitable for the VC task that makes the speech of a source speaker sound like the speech of the target speaker without changing the linguistic content. However, the converted speech is not satisfying because it is difficult to disentangle the pure content representations from the acoustic features due to the lack of linguistic supervision for the content encoder. To address this issue, under the framework of VQ-VAE, connectionist temporal classification (CTC) loss is proposed to guide the content encoder to learn pure content representations by using an auxiliary network. Based on the fact that the CTC loss is not affected by the sequence length of the output of the content encoder, adding the linguistic supervision to the content encoder can be much easier. This non-parallel many-to-many voice conversion model is named as CTC-VQ-VAE. VC experiments on the CMU ARCTIC and VCTK corpus are carried out to evaluate the proposed method. Both the objective and the subjective results show that the proposed approach significantly improves the speech quality and speaker similarity of the converted speech, compared with the traditional VQ-VAE method. (C) 2021 Elsevier Inc. All rights reserved.
引用
收藏
页数:10
相关论文
共 50 条
  • [21] Towards Unseen Speakers Zero-Shot Voice Conversion with Generative Adversarial Networks
    Lu, Weirui
    Xing, Xiaofen
    Xu, Xiangmin
    Zhang, Weibin
    2021 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2021, : 854 - 858
  • [22] Zero-shot image classification based on unknown-class semantic constraint autoencoder
    Wang X.-S.
    Zhang C.
    Cheng Y.-H.
    Kongzhi yu Juece/Control and Decision, 2023, 38 (12): : 3499 - 3506
  • [23] A Distance-Constrained Semantic Autoencoder for Zero-Shot Remote Sensing Scene Classification
    Wang, Chen
    Peng, Guohua
    De Baets, Bernard
    IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2021, 14 : 12545 - 12556
  • [24] MULTI-LABEL ZERO-SHOT AUDIO CLASSIFICATION WITH TEMPORAL ATTENTION
    Dogan, Duygu
    Xie, Huang
    Heittola, Toni
    Virtanen, Tuomas
    2024 18TH INTERNATIONAL WORKSHOP ON ACOUSTIC SIGNAL ENHANCEMENT, IWAENC 2024, 2024, : 250 - 254
  • [25] Flow-VAE VC: End-to-End Flow Framework with Contrastive Loss for Zero-shot Voice Conversion
    Xu, Le
    Zhong, Rongxiu
    Liu, Ying
    Yang, Huibao
    Zhang, Shilei
    INTERSPEECH 2023, 2023, : 2293 - 2297
  • [26] ControlVC: Zero-Shot Voice Conversion with Time-Varying Controls on Pitch and Speed
    Chen, Meiying
    Duan, Zhiyao
    INTERSPEECH 2023, 2023, : 2098 - 2102
  • [27] End-to-End Zero-Shot Voice Conversion with Location-Variable Convolutions
    Kang, Wonjune
    Hasegawa-Johnson, Mark
    Roy, Deb
    INTERSPEECH 2023, 2023, : 2303 - 2307
  • [28] StreamVoice plus : Evolving Into End-to-End Streaming Zero-Shot Voice Conversion
    Wang, Zhichao
    Chen, Yuanzhe
    Wang, Xinsheng
    Xie, Lei
    Wang, Yuping
    IEEE SIGNAL PROCESSING LETTERS, 2024, 31 : 3000 - 3004
  • [29] TRAINING ROBUST ZERO-SHOT VOICE CONVERSION MODELS WITH SELF-SUPERVISED FEATURES
    Trung Dang
    Dung Tran
    Chin, Peter
    Koishida, Kazuhito
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6557 - 6561
  • [30] CA-VC: A Novel Zero-Shot Voice Conversion Method With Channel Attention
    Xiao, Ruitong
    Xing, Xiaofen
    Yang, Jichen
    Xu, Xiangmin
    2021 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2021, : 800 - 807