End-to-End Text-To-Speech synthesis for under resourced South African languages

被引：1

作者：

Nthite, Thapelo ^{[1
]}

Tsoeu, Mohohlo ^{[1
]}

机构：

[1] Univ Cape Town, Dept Elect Engn, Cape Town, South Africa

来源：

2020 INTERNATIONAL SAUPEC/ROBMECH/PRASA CONFERENCE | 2020年

基金：

新加坡国家研究基金会;

关键词：

Speech synthesis; end-to-end; TTS; Sesotho; IsiXhosa;

D O I：

10.1109/saupec/robmech/prasa48453.2020.9041030

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Text-To-Speech (TTS) systems have been widely adopted around the world for various applications, such as reading to the blind and producing speech for dialog systems. There is however, a lack of TTS systems for South African languages due to limited resources. End-to-end TTS methods using Deep Learning have recently been proposed. These methods eliminate the need for time aligned TTS corpora, making them attractive for under resourced languages. In this paper we present the first reported use of an end-to-end approach for implementing TTS systems for isiXhosa and Sesotho. We train the model using the Lwazi II Sotho corpus as well as the Lwazi III Xhosa corpus. The performance of the system is compared to the Qfrency TTS system using a mean opinion score and a word error rate. The results show that the end-to-end system implemented is able to outperform the Qfrency TTS system by an MOS of 0.68 and 0.83 for intelligibility and naturalness respectively. This is one of the first reported implementations of an end-to-end TTS for any South African language.

引用

页码：684 / 689

页数：6

共 25 条

[1]

[Anonymous], 2020, Pearl River

[2] Collecting and evaluating speech recognition corpora for 11 South African languages [J].

Badenhorst, Jaco ;

van Heerden, Charl ;

Davel, Marelie ;

Barnard, Etienne .

LANGUAGE RESOURCES AND EVALUATION, 2011, 45 (03) :289-309

[3]

Boilard J, 2019, 146TH AES CONVENTION

[4]

Calteaux K., 2013, PRETORIA

[5]

Catarina A., 2018, THESIS

[6]

Chorowski J, 2015, ADV NEUR IN, V28

[7]

Davel M., 2014, P 4 INT WORKSH SPOK, P194

[8]

Dusek O., TEXT TO SPEECH SYNTH

[9] SIGNAL ESTIMATION FROM MODIFIED SHORT-TIME FOURIER-TRANSFORM [J].

GRIFFIN, DW ;

LIM, JS .

IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1984, 32 (02) :236-243

[10]

Hande SS, 2014, Int J Latest Technol Eng Manag Appl Sci IJLTEMAS, V3, P12

← 1 2 3 →