StarGAN-VC2: Rethinking Conditional Methods for StarGAN-Based Voice Conversion

被引:73
作者
Kaneko, Takuhiro [1 ]
Kameoka, Hirokazu [1 ]
Tanaka, Kou [1 ]
Hojo, Nobukatsu [1 ]
机构
[1] NTT Corp, NTT Commun Sci Labs, Tokyo, Japan
来源
INTERSPEECH 2019 | 2019年
关键词
voice conversion (VC); non-parallel VC; multi-domain VC; generative adversarial networks (GANs); StarGAN-VC; NEURAL-NETWORKS;
D O I
10.21437/Interspeech.2019-2236
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Non-parallel multi-domain voice conversion (VC) is a technique for learning mappings among multiple domains without relying on parallel data. This is important but challenging owing to the requirement of learning multiple mappings and the non-availability of explicit supervision. Recently, StarGAN-VC has garnered attention owing to its ability to solve this problem only using a single generator. However, there is still a gap between real and converted speech. To bridge this gap, we rethink conditional methods of StarGAN-VC, which are key components for achieving non-parallel multi-domain VC in a single model, and propose an improved variant called StarGAN-VC2. Particularly, we rethink conditional methods in two aspects: training objectives and network architectures. For the former, we propose a source-and-target conditional adversarial loss that allows all source domain data to be convertible to the target domain data. For the latter, we introduce a modulation-based conditional method that can transform the modulation of the acoustic feature in a domain-specific manner. We evaluated our methods on non-parallel multi-speaker VC. An objective evaluation demonstrates that our proposed methods improve speech quality in terms of both global and local structure measures. Furthermore, a subjective evaluation shows that StarGAN-VC2 outperforms StarGAN-VC in terms of naturalness and speaker similarity.(1)
引用
收藏
页码:679 / 683
页数:5
相关论文
共 50 条
[1]  
[Anonymous], 2014, P ISCA INTERSPEECH
[2]   Voice Conversion Using Deep Neural Networks With Layer-Wise Generative Training [J].
Chen, Ling-Hui ;
Ling, Zhen-Hua ;
Liu, Li-Juan ;
Dai, Li-Rong .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2014, 22 (12) :1859-1872
[3]   StarGAN: Unified Generative Adversarial Networks for Multi-Domain Image-to-Image Translation [J].
Choi, Yunjey ;
Choi, Minje ;
Kim, Munyoung ;
Ha, Jung-Woo ;
Kim, Sunghun ;
Choo, Jaegul .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :8789-8797
[4]  
Dauphin YN, 2017, PR MACH LEARN RES, V70
[5]   Spectral Mapping Using Artificial Neural Networks for Voice Conversion [J].
Desai, Srinivas ;
Black, Alan W. ;
Yegnanarayana, B. ;
Prahallad, Kishore .
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2010, 18 (05) :954-964
[6]  
Dumoulin Vincent, 2017, ICLR
[7]   Generative Adversarial Networks [J].
Goodfellow, Ian ;
Pouget-Abadie, Jean ;
Mirza, Mehdi ;
Xu, Bing ;
Warde-Farley, David ;
Ozair, Sherjil ;
Courville, Aaron ;
Bengio, Yoshua .
COMMUNICATIONS OF THE ACM, 2020, 63 (11) :139-144
[8]   Deep Residual Learning for Image Recognition [J].
He, Kaiming ;
Zhang, Xiangyu ;
Ren, Shaoqing ;
Sun, Jian .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :770-778
[9]   Voice Conversion from Unaligned Corpora using Variational Autoencoding Wasserstein Generative Adversarial Networks [J].
Hsu, Chin-Cheng ;
Hwang, Hsin-Te ;
Wu, Yi-Chiao ;
Tsao, Yu ;
Wang, Hsin-Min .
18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, :3364-3368
[10]  
Kameoka H, 2020, Arxiv, DOI arXiv:1811.01609