AGAIN-VC: A ONE-SHOT VOICE CONVERSION USING ACTIVATION GUIDANCE AND ADAPTIVE INSTANCE NORMALIZATION

被引:78
作者
Chen, Yen-Hao [1 ]
Wu, Da-Yi [1 ]
Wu, Tsung-Han [1 ]
Lee, Hung-yi [1 ]
机构
[1] Natl Taiwan Univ, Coll Elect Engn & Comp Sci, Taipei, Taiwan
来源
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021) | 2021年
关键词
Voice conversion; adaptive instance normalization; activation guidance; disentangled representations;
D O I
10.1109/ICASSP39728.2021.9414257
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Recently, voice conversion (VC) has been widely studied. Many VC systems use disentangle-based learning techniques to separate the speaker and the linguistic content information from a speech signal. Subsequently, they convert the voice by changing the speaker information to that of the target speaker. To prevent the speaker information from leaking into the content embeddings, previous works either reduce the dimension or quantize the content embedding as a strong information bottleneck. These mechanisms somehow hurt the synthesis quality. In this work, we propose AGAIN-VC, an innovative VC system using Activation Guidance and Adaptive Instance Normalization. AGAIN-VC is an auto-encoder-based model, comprising of a single encoder and a decoder. With a proper activation as an information bottleneck on content embeddings, the trade-off between the synthesis quality and the speaker similarity of the converted speech is improved drastically. This one-shot VC system obtains the best performance regardless of the subjective or objective evaluations.
引用
收藏
页码:5954 / 5958
页数:5
相关论文
共 27 条
[1]   One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization [J].
Chou, Ju-chieh ;
Lee, Hung-Yi .
INTERSPEECH 2019, 2019, :664-668
[2]  
Chou JC, 2018, INTERSPEECH, P501
[3]  
Goodfellow IJ, 2014, ADV NEUR IN, V27, P2672
[4]  
Hasegawa-Johnson Mark, 2019, ARXIV190505879
[5]   Voice Conversion from Unaligned Corpora using Variational Autoencoding Wasserstein Generative Adversarial Networks [J].
Hsu, Chin-Cheng ;
Hwang, Hsin-Te ;
Wu, Yi-Chiao ;
Tsao, Yu ;
Wang, Hsin-Min .
18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, :3364-3368
[6]  
Huang WC, 2018, 2018 11TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), P51, DOI 10.1109/ISCSLP.2018.8706604
[7]   Arbitrary Style Transfer in Real-time with Adaptive Instance Normalization [J].
Huang, Xun ;
Belongie, Serge .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :1510-1519
[8]  
Kameoka H, 2018, IEEE W SP LANG TECH, P266, DOI 10.1109/SLT.2018.8639535
[9]  
Kaneko T., 2017, arXiv
[10]   StarGAN-VC2: Rethinking Conditional Methods for StarGAN-Based Voice Conversion [J].
Kaneko, Takuhiro ;
Kameoka, Hirokazu ;
Tanaka, Kou ;
Hojo, Nobukatsu .
INTERSPEECH 2019, 2019, :679-683