StarGAN-VC2: Rethinking Conditional Methods for StarGAN-Based Voice Conversion

被引:64
作者
Kaneko, Takuhiro [1 ]
Kameoka, Hirokazu [1 ]
Tanaka, Kou [1 ]
Hojo, Nobukatsu [1 ]
机构
[1] NTT Corp, NTT Commun Sci Labs, Tokyo, Japan
来源
INTERSPEECH 2019 | 2019年
关键词
voice conversion (VC); non-parallel VC; multi-domain VC; generative adversarial networks (GANs); StarGAN-VC; NEURAL-NETWORKS;
D O I
10.21437/Interspeech.2019-2236
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Non-parallel multi-domain voice conversion (VC) is a technique for learning mappings among multiple domains without relying on parallel data. This is important but challenging owing to the requirement of learning multiple mappings and the non-availability of explicit supervision. Recently, StarGAN-VC has garnered attention owing to its ability to solve this problem only using a single generator. However, there is still a gap between real and converted speech. To bridge this gap, we rethink conditional methods of StarGAN-VC, which are key components for achieving non-parallel multi-domain VC in a single model, and propose an improved variant called StarGAN-VC2. Particularly, we rethink conditional methods in two aspects: training objectives and network architectures. For the former, we propose a source-and-target conditional adversarial loss that allows all source domain data to be convertible to the target domain data. For the latter, we introduce a modulation-based conditional method that can transform the modulation of the acoustic feature in a domain-specific manner. We evaluated our methods on non-parallel multi-speaker VC. An objective evaluation demonstrates that our proposed methods improve speech quality in terms of both global and local structure measures. Furthermore, a subjective evaluation shows that StarGAN-VC2 outperforms StarGAN-VC in terms of naturalness and speaker similarity.(1)
引用
收藏
页码:679 / 683
页数:5
相关论文
共 50 条
  • [1] [Anonymous], 2014, P ISCA INTERSPEECH
  • [2] Voice Conversion Using Deep Neural Networks With Layer-Wise Generative Training
    Chen, Ling-Hui
    Ling, Zhen-Hua
    Liu, Li-Juan
    Dai, Li-Rong
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2014, 22 (12) : 1859 - 1872
  • [3] StarGAN: Unified Generative Adversarial Networks for Multi-Domain Image-to-Image Translation
    Choi, Yunjey
    Choi, Minje
    Kim, Munyoung
    Ha, Jung-Woo
    Kim, Sunghun
    Choo, Jaegul
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 8789 - 8797
  • [4] Dauphin YN, 2017, PR MACH LEARN RES, V70
  • [5] Spectral Mapping Using Artificial Neural Networks for Voice Conversion
    Desai, Srinivas
    Black, Alan W.
    Yegnanarayana, B.
    Prahallad, Kishore
    [J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2010, 18 (05): : 954 - 964
  • [6] Dumoulin Vincent, 2017, ICLR
  • [7] Generative Adversarial Networks
    Goodfellow, Ian
    Pouget-Abadie, Jean
    Mirza, Mehdi
    Xu, Bing
    Warde-Farley, David
    Ozair, Sherjil
    Courville, Aaron
    Bengio, Yoshua
    [J]. COMMUNICATIONS OF THE ACM, 2020, 63 (11) : 139 - 144
  • [8] Deep Residual Learning for Image Recognition
    He, Kaiming
    Zhang, Xiangyu
    Ren, Shaoqing
    Sun, Jian
    [J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 770 - 778
  • [9] Voice Conversion from Unaligned Corpora using Variational Autoencoding Wasserstein Generative Adversarial Networks
    Hsu, Chin-Cheng
    Hwang, Hsin-Te
    Wu, Yi-Chiao
    Tsao, Yu
    Wang, Hsin-Min
    [J]. 18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 3364 - 3368
  • [10] Kameoka H, 2020, Arxiv, DOI arXiv:1811.01609