Emphatic Speech Synthesis and Control Based on Characteristic Transferring in End-to-End Speech Synthesis

被引：0

作者：

Wang, Mu ^{[1
]}

Wu, Zhiyong ^{[1
,2
]}

Wu, Xixin ^{[2
]}

Meng, Helen ^{[1
,2
]}

Kang, Shiyin ^{[3
]}

Jia, Jia ^{[1
]}

Cai, Lianhong ^{[1
]}

机构：

[1] Tsinghua Univ, Shenzhen, Peoples R China

[2] Chinese Univ Hong Kong, Hong Kong, Peoples R China

[3] Tencent AI Lab, Shenzhen, Peoples R China

来源：

2018 FIRST ASIAN CONFERENCE ON AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION (ACII ASIA) | 2018年

基金：

中国国家自然科学基金; 国家高技术研究发展计划(863计划);

关键词：

end-to-end; expressive speech; multi-speaker speech synthesis; transfer learning; emphatic speech;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

End-to-end text-to-speech (E2E TTS) synthesis has achieved great success. This work investigates the emphatic speech synthesis and control mechanisms in the E2E framework and proposes an E2E-based method for transferring emphasis characteristic between speakers. Characteristic differences between emphatic and neutral speech are learned from a small-scale corpus containing parallel neutral and emphasis speech utterances recorded by one speaker and further transferred to another speaker so that we can generate emphatic speech with latter speakers voice. Emphasis embedding is injected to the encoder of the extended E2E TTS model to capture the aforementioned differences; while the decoder and attention module are used to decode those differences into synthetic neutral / emphatic speech. Speaker codes linked to the decoder and attention module provide the E2E model the ability for characteristic transferring between speakers. To control the emphatic strength, an encoder memory manipulation mechanism is proposed. Experimental results indicate the effectiveness of our proposed model.

引用

页数：6

共 13 条

[1]

[Anonymous], 2017, ABS170310135 CORR

[2]

[Anonymous], 2017, P INT C NEUR INF PRO

[3]

Arik SO, 2017, PR MACH LEARN RES, V70

[4]

Chen Szu-wei, 2009, P ANN C INT SPEECH C

[5]

Costa Francisco, 2004, P INT C SPEECH PROS

[6] SIGNAL ESTIMATION FROM MODIFIED SHORT-TIME FOURIER-TRANSFORM [J].

GRIFFIN, DW ;

LIM, JS .

IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1984, 32 (02) :236-243

[7] Principles for learning controllable TTS from annotated and latent variation [J].

Henter, Gustav Eje ;

Lorenzo-Trueba, Jaime ;

Wang, Xin ;

Yamagishi, Junichi .

18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, :3956-3960

[8]

King DB, 2015, ACS SYM SER, V1214, P1

[9]

Luong HT, 2017, INT CONF ACOUST SPEE, P4905, DOI 10.1109/ICASSP.2017.7953089

[10]

Ning YS, 2015, INT CONF ACOUST SPEE, P4934, DOI 10.1109/ICASSP.2015.7178909

← 1 2 →