CycleGAN-VC3: Examining and Improving CycleGAN-VCs for Mel-spectrogram Conversion

被引:37
作者
Kaneko, Takuhiro [1 ]
Kameoka, Hirokazu [1 ]
Tanaka, Kou [1 ]
Hojo, Nobukatsu [1 ]
机构
[1] NTT Corp, NTT Commun Sci Labs, Chiyoda City, Tokyo, Japan
来源
INTERSPEECH 2020 | 2020年
关键词
voice conversion (VC); non-parallel VC; generative adversarial networks (GANs); CycleGAN-VC; mel-spectrogram conversion; VOICE CONVERSION;
D O I
10.21437/Interspeech.2020-2280
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Non-parallel voice conversion (VC) is a technique for learning mappings between source and target speeches without using a parallel corpus. Recently, cycle-consistent adversarial network (CycleGAN)-VC and CycleGAN-VC2 have shown promising results regarding this problem and have been widely used as benchmark methods. However, owing to the ambiguity of the effectiveness of CycleGAN-VC/VC2 for mel-spectrogram conversion, they are typically used for mel-cepstrum conversion even when comparative methods employ mel-spectrogram as a conversion target. To address this, we examined the applicability of CycleGAN-VC/VC2 to mel-spectrogram conversion. Through initial experiments, we discovered that their direct applications compromised the time-frequency structure that should be preserved during conversion. To remedy this, we propose CycleGAN-VC3, an improvement of CycleGAN-VC2 that incorporates time-frequency adaptive normalization (TFAN). Using TFAN, we can adjust the scale and bias of the converted features while reflecting the time-frequency structure of the source mel-spectrogram. We evaluated CycleGAN-VC3 on inter-gender and intra-gender non-parallel VC. A subjective evaluation of naturalness and similarity showed that for every VC pair, CycleGAN-VC3 outperforms or is competitive with the two types of CycleGAN-VC2, one of which was applied to mel-cepstrum and the other to mel-spectrogram.(1)
引用
收藏
页码:2017 / 2021
页数:5
相关论文
共 42 条
[1]  
[Anonymous], 2019, P NEURIPS
[2]  
Dauphin YN, 2017, PR MACH LEARN RES, V70
[3]   Spectral Mapping Using Artificial Neural Networks for Voice Conversion [J].
Desai, Srinivas ;
Black, Alan W. ;
Yegnanarayana, B. ;
Prahallad, Kishore .
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2010, 18 (05) :954-964
[4]  
Goodfellow IJ, 2014, ADV NEUR IN, V27, P2672
[5]   Deep Residual Learning for Image Recognition [J].
He, Kaiming ;
Zhang, Xiangyu ;
Ren, Shaoqing ;
Sun, Jian .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :770-778
[6]  
Hsu C.-W., 2016, A Practical Guide to Support Vector Classification
[7]   Voice Conversion from Unaligned Corpora using Variational Autoencoding Wasserstein Generative Adversarial Networks [J].
Hsu, Chin-Cheng ;
Hwang, Hsin-Te ;
Wu, Yi-Chiao ;
Tsao, Yu ;
Wang, Hsin-Min .
18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, :3364-3368
[8]   Data-driven emotion conversion in spoken English [J].
Inanoglu, Zeynep ;
Young, Steve .
SPEECH COMMUNICATION, 2009, 51 (03) :268-283
[9]  
Ioffe S, 2015, PR MACH LEARN RES, V37, P448
[10]   Improving the intelligibility of dysarthric speech [J].
Kain, Alexander B. ;
Hosom, John-Paul ;
Niu, Xiaochuan ;
van Santen, Jan P. H. ;
Fried-Oken, Melanie ;
Staehely, Janice .
SPEECH COMMUNICATION, 2007, 49 (09) :743-759