Testing the Limits of Representation Mixing for Pronunciation Correction in End-to-End Speech Synthesis

被引：2

作者：

Fong, Jason ^{[1
]}

Taylor, Jason ^{[1
]}

King, Simon ^{[1
]}

机构：

[1] Univ Edinburgh, Ctr Speech Technol Res, Edinburgh, Midlothian, Scotland

来源：

INTERSPEECH 2020 | 2020年

关键词：

speech synthesis; representation mixing; pronunciation control;

D O I：

10.21437/Interspeech.2020-2618

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

Accurate pronunciation is an essential requirement for text-to-speech (TTS) systems. Systems trained on raw text exhibit pronunciation errors in output speech due to ambiguous letter-to-sound relations. Without an intermediate phonemic representation, it is difficult to intervene and correct these errors. Retaining explicit control over pronunciation runs counter to the current drive toward end-to-end (E2E) TTS using sequence-to-sequence models. On the one hand, E2E TTS aims to eliminate manual intervention, especially expert skill such as phonemic transcription of words in a lexicon. On the other, a system making difficult-to-correct pronunciation errors is of little practical use. Some intervention is necessary. We explore the minimal amount of linguistic features required to correct pronunciation errors in an otherwise E2E TTS system that accepts graphemic input. We use representation-mixing: within each sequence the system accepts either graphemic and/or phonemic input. We quantify how little training data needs to be phonemically labelled - that is, how small a lexicon must be written - to ensure control over pronunciation. We find modest correction is possible with 500 phonemised word types from the LJ speech dataset but correction works best when the majority of word types are phonemised with syllable boundaries.

引用

页码：4019 / 4023

页数：5

共 15 条

[1] Multisyn: Open-domain unit selection for the Festival speech synthesis system [J].

Clark, Robert A. J. ;

Richmond, Korin ;

King, Simon .

SPEECH COMMUNICATION, 2007, 49 (04) :317-330

[2]

CMU, 2020, CARN MELL PRON DICT

[3] The Kestrel TTS text normalization system [J].

Ebden, Peter ;

Sproat, Richard .

NATURAL LANGUAGE ENGINEERING, 2015, 21 (03) :333-353

[4]

Fatchord, 2019, TAC WAVERNN IMPL

[5]

Fitt S., 2020, UNISYN LEXICON

[6]

Fitt S., 2006, P INTERSPEECH, P1202

[7]

Hayashi T, 2020, INT CONF ACOUST SPEE, P7654, DOI [10.1109/ICASSP40776.2020.9053512, 10.1109/icassp40776.2020.9053512]

[8]

Ito Keith, The LJ speech dataset

[9]

Kalchbrenner N., 2018, PMLR, P2410

[10]

Kastner K, 2019, INT CONF ACOUST SPEE, P5906, DOI [10.1109/icassp.2019.8682880, 10.1109/ICASSP.2019.8682880]

← 1 2 →