AGAIN-VC: A ONE-SHOT VOICE CONVERSION USING ACTIVATION GUIDANCE AND ADAPTIVE INSTANCE NORMALIZATION

被引：78

作者：

Chen, Yen-Hao ^{[1
]}

Wu, Da-Yi ^{[1
]}

Wu, Tsung-Han ^{[1
]}

Lee, Hung-yi ^{[1
]}

机构：

[1] Natl Taiwan Univ, Coll Elect Engn & Comp Sci, Taipei, Taiwan

来源：

2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021) | 2021年

关键词：

Voice conversion; adaptive instance normalization; activation guidance; disentangled representations;

D O I：

10.1109/ICASSP39728.2021.9414257

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Recently, voice conversion (VC) has been widely studied. Many VC systems use disentangle-based learning techniques to separate the speaker and the linguistic content information from a speech signal. Subsequently, they convert the voice by changing the speaker information to that of the target speaker. To prevent the speaker information from leaking into the content embeddings, previous works either reduce the dimension or quantize the content embedding as a strong information bottleneck. These mechanisms somehow hurt the synthesis quality. In this work, we propose AGAIN-VC, an innovative VC system using Activation Guidance and Adaptive Instance Normalization. AGAIN-VC is an auto-encoder-based model, comprising of a single encoder and a decoder. With a proper activation as an information bottleneck on content embeddings, the trade-off between the synthesis quality and the speaker similarity of the converted speech is improved drastically. This one-shot VC system obtains the best performance regardless of the subjective or objective evaluations.

引用

页码：5954 / 5958

页数：5

共 27 条

[1] One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization [J].

Chou, Ju-chieh ;

Lee, Hung-Yi .

INTERSPEECH 2019, 2019, :664-668

[2]

Chou JC, 2018, INTERSPEECH, P501

[3]

Goodfellow IJ, 2014, ADV NEUR IN, V27, P2672

[4]

Hasegawa-Johnson Mark, 2019, ARXIV190505879

[5] Voice Conversion from Unaligned Corpora using Variational Autoencoding Wasserstein Generative Adversarial Networks [J].

Hsu, Chin-Cheng ;

Hwang, Hsin-Te ;

Wu, Yi-Chiao ;

Tsao, Yu ;

Wang, Hsin-Min .

18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, :3364-3368

[6]

Huang WC, 2018, 2018 11TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), P51, DOI 10.1109/ISCSLP.2018.8706604

[7] Arbitrary Style Transfer in Real-time with Adaptive Instance Normalization [J].

Huang, Xun ;

Belongie, Serge .

2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :1510-1519

[8]

Kameoka H, 2018, IEEE W SP LANG TECH, P266, DOI 10.1109/SLT.2018.8639535

[9]

Kaneko T., 2017, arXiv

[10] StarGAN-VC2: Rethinking Conditional Methods for StarGAN-Based Voice Conversion [J].

Kaneko, Takuhiro ;

Kameoka, Hirokazu ;

Tanaka, Kou ;

Hojo, Nobukatsu .

INTERSPEECH 2019, 2019, :679-683

← 1 2 3 →