StarGAN-VC2: Rethinking Conditional Methods for StarGAN-Based Voice Conversion

被引：73

作者：

Kaneko, Takuhiro ^{[1
]}

Kameoka, Hirokazu ^{[1
]}

Tanaka, Kou ^{[1
]}

Hojo, Nobukatsu ^{[1
]}

机构：

[1] NTT Corp, NTT Commun Sci Labs, Tokyo, Japan

来源：

INTERSPEECH 2019 | 2019年

关键词：

voice conversion (VC); non-parallel VC; multi-domain VC; generative adversarial networks (GANs); StarGAN-VC; NEURAL-NETWORKS;

D O I：

10.21437/Interspeech.2019-2236

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

Non-parallel multi-domain voice conversion (VC) is a technique for learning mappings among multiple domains without relying on parallel data. This is important but challenging owing to the requirement of learning multiple mappings and the non-availability of explicit supervision. Recently, StarGAN-VC has garnered attention owing to its ability to solve this problem only using a single generator. However, there is still a gap between real and converted speech. To bridge this gap, we rethink conditional methods of StarGAN-VC, which are key components for achieving non-parallel multi-domain VC in a single model, and propose an improved variant called StarGAN-VC2. Particularly, we rethink conditional methods in two aspects: training objectives and network architectures. For the former, we propose a source-and-target conditional adversarial loss that allows all source domain data to be convertible to the target domain data. For the latter, we introduce a modulation-based conditional method that can transform the modulation of the acoustic feature in a domain-specific manner. We evaluated our methods on non-parallel multi-speaker VC. An objective evaluation demonstrates that our proposed methods improve speech quality in terms of both global and local structure measures. Furthermore, a subjective evaluation shows that StarGAN-VC2 outperforms StarGAN-VC in terms of naturalness and speaker similarity.(1)

引用

页码：679 / 683

页数：5

共 50 条

[1]

[Anonymous], 2014, P ISCA INTERSPEECH

[2] Voice Conversion Using Deep Neural Networks With Layer-Wise Generative Training [J].

Chen, Ling-Hui ;

Ling, Zhen-Hua ;

Liu, Li-Juan ;

Dai, Li-Rong .

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2014, 22 (12) :1859-1872

[3] StarGAN: Unified Generative Adversarial Networks for Multi-Domain Image-to-Image Translation [J].

Choi, Yunjey ;

Choi, Minje ;

Kim, Munyoung ;

Ha, Jung-Woo ;

Kim, Sunghun ;

Choo, Jaegul .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :8789-8797

[4]

Dauphin YN, 2017, PR MACH LEARN RES, V70

[5] Spectral Mapping Using Artificial Neural Networks for Voice Conversion [J].

Desai, Srinivas ;

Black, Alan W. ;

Yegnanarayana, B. ;

Prahallad, Kishore .

IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2010, 18 (05) :954-964

[6]

Dumoulin Vincent, 2017, ICLR

[7] Generative Adversarial Networks [J].

Goodfellow, Ian ;

Pouget-Abadie, Jean ;

Mirza, Mehdi ;

Xu, Bing ;

Warde-Farley, David ;

Ozair, Sherjil ;

Courville, Aaron ;

Bengio, Yoshua .

COMMUNICATIONS OF THE ACM, 2020, 63 (11) :139-144

[8] Deep Residual Learning for Image Recognition [J].

He, Kaiming ;

Zhang, Xiangyu ;

Ren, Shaoqing ;

Sun, Jian .

2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :770-778

[9] Voice Conversion from Unaligned Corpora using Variational Autoencoding Wasserstein Generative Adversarial Networks [J].

Hsu, Chin-Cheng ;

Hwang, Hsin-Te ;

Wu, Yi-Chiao ;

Tsao, Yu ;

Wang, Hsin-Min .

18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, :3364-3368

[10]

Kameoka H, 2020, Arxiv, DOI arXiv:1811.01609

← 1 2 3 4 5 →