CODE-SWITCHED SPEECH SYNTHESIS USING BILINGUAL PHONETIC POSTERIORGRAM WITH ONLY MONOLINGUAL CORPORA

被引：0

作者：

Cao, Yuewen ^{[1
,4
]}

Liu, Songxiang ^{[1
]}

Wu, Xixin ^{[1
]}

Kang, Shiyin ^{[3
]}

Liu, Peng ^{[3
]}

Wu, Zhiyong ^{[1
,2
]}

Liu, Xunying ^{[1
]}

Su, Dan ^{[3
]}

Yu, Dong ^{[3
]}

Meng, Helen ^{[1
]}

机构：

[1] Chinese Univ Hong Kong, Dept Syst Engn & Engn Management, Human Comp Commun Lab, Hong Kong, Peoples R China

[2] Tsinghua Univ, Tsinghua CUHK Joint Res Ctr Media Sci Technol & S, Grad Sch Shenzhen, Shenzhen, Peoples R China

[3] Tencent, Tencent AI Lab, Shenzhen, Peoples R China

[4] Tencent, Shenzhen, Peoples R China

来源：

2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING | 2020年

基金：

中国国家自然科学基金;

关键词：

code-switching; speech synthesis; phonetic posteriorgrams;

D O I：

10.1109/icassp40776.2020.9053094

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Synthesizing fluent code-switched (CS) speech with consistent voice using only monolingual corpora is still a challenging task, since language alternation seldom occurs during training and the speaker identity is directly correlated with language. In this paper, we present a bilingual phonetic posteriorgram (PPG) based CS speech synthesizer using only monolingual corpora. The bilingual PPG is used to bridge across speakers and languages, which is formed by stacking two monolingual PPGs extracted from two monolingual speaker-independent speech recognition systems. It is assumed that bilingual PPG can represent the articulation of speech sounds speaker-independently and captures accurate phonetic information of both languages in the same feature space. The proposed model first extracts bilingual PPGs from training data. Then an encoder-decoder based model is used to learn the relationship between input text and bilingual PPGs, and the bilingual PPGs are mapped to acoustic features using bidirectional long-short term memory based model conditioned on speaker embedding to control speaker identity. Experiments validate the effectiveness of the proposed model in terms of speech intelligibility, audio fidelity and speaker consistency of the generated code-switched speech.

引用

页码：7619 / 7623

页数：5

共 34 条

[1]

[Anonymous], 2011, P BLIZZ CHALL WORKSH

[2]

[Anonymous], 2004, 5 ISCA WORKSH SPEECH

[3]

[Anonymous], 2017, ADV NEURAL INFORM PR

[4]

[Anonymous], 2015, PROC NEURIPS, P577, DOI DOI 10.1016/0167-739X(94)90007-8

[5]

[Anonymous], 2017, P INTERSPEECH

[6]

Bu H., 2017, 20 C OR CHAPT INT CO

[7]

Cao Y., 2019, P INT C AC SPEECH SI

[8]

Chu M., 2003, P INT C AC SPEECH SI

[9]

Garofolo J. S., 1993, Timit acousticphonetic continuous speech corpus

[10]

Hazen Timothy J, 2009, IEEE WORKSH AUT SPEE

← 1 2 3 4 →