One-shot Voice Conversion with Global Speaker Embeddings

被引：34

作者：

Lu, Hui ^{[1
,2
]}

Wu, Zhiyong ^{[1
,2
,3
]}

Dai, Dongyang ^{[1
,2
]}

Li, Runnan ^{[1
,2
]}

Kang, Shiyin ^{[4
]}

Jia, Jia ^{[1
,2
]}

Meng, Helen ^{[1
,3
]}

机构：

[1] Tsinghua Univ, Grad Sch Shenzhen, Tsinghua CUHK Joint Res Ctr Media Sci Technol & S, Shenzhen, Peoples R China

[2] Tsinghua Univ, Beijing Natl Res Ctr Informat Sci & Technol BNRis, Dept Comp Sci & Technol, Beijing, Peoples R China

[3] Chinese Univ Hong Kong, Dept Syst Engn & Engn Management, Shatin, Hong Kong, Peoples R China

[4] Tencent, Tencent AI Lab, Shenzhen, Peoples R China

来源：

INTERSPEECH 2019 | 2019年

基金：

中国国家自然科学基金;

关键词：

voice conversion; one-shot; global speaker embedding; WaveNet;

D O I：

10.21437/Interspeech.2019-2365

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

Building a voice conversion (VC) system for a new target speaker typically requires a large amount of speech data from the target speaker. This paper investigates a method to build a VC system for arbitrary target speaker using one given utterance without any adaptation training process. Inspired by global style tokens (GSTs), which recently has been shown to be effective in controlling the style of synthetic speech, we propose the use of global speaker embeddings (GSEs) to control the conversion target of the VC system. Speaker-independent phonetic posteriorgrams (PPGs) are employed as the local condition input to a conditional WaveNet synthesizer for waveform generation of the target speaker. Meanwhile, spectrograms are extracted from the given utterance and fed into a reference encoder, the generated reference embedding is then employed as attention query to the GSEs to produce the speaker embedding, which is employed as the global condition input to the WaveNet synthesizer to control the generated waveform's speaker identity. In experiments, when compared with an adaptation training based any-to-any VC system, the proposed GSEs based VC approach performs equally well or better in both speech naturalness and speaker similarity, with apparently higher flexibility to the comparison.

引用

页码：669 / 673

页数：5

共 22 条

[1]

[Anonymous], 1993, TIMIT ACOUSTIC PHONE

[2] Less Is More: Picking Informative Frames for Video Captioning [J].

Chen, Yangyu ;

Wang, Shuhui ;

Zhang, Weigang ;

Huang, Qingming .

COMPUTER VISION - ECCV 2018, PT XIII, 2018, 11217 :367-384

[3]

Cho KYHY, 2014, Arxiv, DOI arXiv:1409.1259

[4]

Graves A, 2012, STUD COMPUT INTELL, V385, P1, DOI [10.1162/neco.1997.9.1.1, 10.1007/978-3-642-24797-2]

[5]

Liu LJ, 2018, INTERSPEECH, P1983

[6]

Liu SX, 2018, INTERSPEECH, P496

[7]

Lorenzo-Trueba J., 2018, P SPEAK LANG REC WOR

[8]

Lu H., 2019, 2019 IEEE INT C ACOU

[9] Investigation of using disentangled and interpretable representations for one-shot cross-lingual voice conversion [J].

Mohammadi, Seyed Hamidreza ;

Kim, Taehwan .

19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, :2833-2837

[10] An overview of voice conversion systems [J].

Mohammadi, Seyed Hamidreza ;

Kain, Alexander .

SPEECH COMMUNICATION, 2017, 88 :65-82

← 1 2 3 →