Cross-Lingual Neural Network Speech Synthesis Based on Multiple Embeddings

被引:4
作者
Nosek, Tijana, V [1 ]
Suzic, Sinisa B. [1 ]
Pekar, Darko J. [2 ]
Obradovic, Radovan J. [2 ]
Secujski, Milan S. [1 ]
Delic, Vlado D. [1 ]
机构
[1] Univ Novi Sad, Fac Tech Sci, Novi Sad, Serbia
[2] AlfaNum Speech Technol Ltd, Novi Sad, Serbia
关键词
Cross-lingual; Neural Networks; Speech Synthesis; Vocoder;
D O I
10.9781/ijimai.2021.11.005
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The paper presents a novel architecture and method for speech synthesis in multiple languages, in voices of multiple speakers and in multiple speaking styles, even in cases when speech from a particular speaker in the target language was not present in the training data. The method is based on the application of neural network embedding to combinations of speaker and style IDs, but also to phones in particular phonetic contexts, without any prior linguistic knowledge on their phonetic properties. This enables the network not only to efficiently capture similarities and differences between speakers and speaking styles, but to establish appropriate relationships between phones belonging to different languages, and ultimately to produce synthetic speech in the voice of a certain speaker in a language that he/she has never spoken. The validity of the proposed approach has been confirmed through experiments with models trained on speech corpora of American English and Mexican Spanish. It has also been shown that the proposed approach supports the use of neural vocoders, i.e. that they are able to produce synthesized speech of good quality even in languages that they were not trained on.
引用
收藏
页码:110 / 120
页数:11
相关论文
共 31 条
[1]  
Abadi M, 2016, PROCEEDINGS OF OSDI'16: 12TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, P265
[2]  
Ark S.., 2017, P 34 INT C MACHINE L, V70
[3]  
Badino L., 2004, P 5 ISCA WORKSH SPEE, P217
[4]  
Campbell N., 1998, Proc ESCA/COCOSDA ETRW on Speech Syntheses, Jenolan Caves, P177
[5]  
Chen M, 2019, P INTERSPEECH, P2105
[6]  
Chu M, 2003, 2003 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL I, PROCEEDINGS, P264
[7]  
Delic T., 2018, P 17 INT S INFOTEH J, P1
[8]  
Fan YC, 2015, INT CONF ACOUST SPEE, P4475, DOI 10.1109/ICASSP.2015.7178817
[9]  
Fan YC, 2016, INT CONF ACOUST SPEE, P5540, DOI 10.1109/ICASSP.2016.7472737
[10]  
Fan YC, 2016, INT CONF ACOUST SPEE, P5135, DOI 10.1109/ICASSP.2016.7472656