End-to-end text-to-speech synthesis with unaligned multiple language units based on attention

被引：2

作者：

Aso, Masashi ^{[1
]}

Takamichi, Shinnosuke ^{[1
]}

Saruwatari, Hiroshi ^{[1
]}

机构：

[1] Univ Tokyo, Grad Sch Informat Sci & Technol, Tokyo, Japan

来源：

INTERSPEECH 2020 | 2020年

关键词：

End-to-end; Text-to-speech; Subword; Progressive training; Transformer;

D O I：

10.21437/Interspeech.2020-2347

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

This paper presents the use of unaligned multiple language units for end-to-end text-to-speech (TTS). End-to-end TTS is a promising technology in that it does not require intermediate representation such as prosodic contexts. However, it causes mispronunciation and unnatural prosody. To alleviate this problem, previous methods have used multiple language units, e.g., phonemes and characters, but required the units to be hard-aligned. In this paper, we propose a multi-input attention structure that simultaneously accepts multiple language units without alignments among them. We consider using not only traditional phonemes and characters but also subwords tokenized in a language-independent manner. We also propose a progressive training strategy to deal with the unaligned multiple language units. The experimental results demonstrated that our model and training strategy improve speech quality.

引用

页码：4009 / 4013

页数：5

共 31 条

[1]

Akiyama T., 2018, P APSIPA HAW US, P660

[2]

[Anonymous], LANCERS

[3]

[Anonymous], DEEPVOICE3 PYTORCH

[4]

Aso M., 2019, P SSW VIENN AUSTR SE, P234

[5]

Devlin J., 2019, CORR, V1, P4171

[6]

Fong J., 2019, 10 ISCA SPEECH SYNTH, P223

[7]

HE K, 2016, P C COMP VIS PATT RE, DOI [DOI 10.1007/978-3-319-46493-0_38, 10.1007/978-3-319-46493-0_38, DOI 10.1109/CVPR.2016.90]

[8]

Karras T., 2017, 6 INT C LEARNING REP

[9]

Kastner K, 2019, INT CONF ACOUST SPEE, P5906, DOI [10.1109/icassp.2019.8682880, 10.1109/ICASSP.2019.8682880]

[10]

Kudo T., 2004, P 2004 C EMP METH NA, P230

← 1 2 3 4 →