End-to-end text-to-speech synthesis with unaligned multiple language units based on attention

被引:2
作者
Aso, Masashi [1 ]
Takamichi, Shinnosuke [1 ]
Saruwatari, Hiroshi [1 ]
机构
[1] Univ Tokyo, Grad Sch Informat Sci & Technol, Tokyo, Japan
来源
INTERSPEECH 2020 | 2020年
关键词
End-to-end; Text-to-speech; Subword; Progressive training; Transformer;
D O I
10.21437/Interspeech.2020-2347
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
This paper presents the use of unaligned multiple language units for end-to-end text-to-speech (TTS). End-to-end TTS is a promising technology in that it does not require intermediate representation such as prosodic contexts. However, it causes mispronunciation and unnatural prosody. To alleviate this problem, previous methods have used multiple language units, e.g., phonemes and characters, but required the units to be hard-aligned. In this paper, we propose a multi-input attention structure that simultaneously accepts multiple language units without alignments among them. We consider using not only traditional phonemes and characters but also subwords tokenized in a language-independent manner. We also propose a progressive training strategy to deal with the unaligned multiple language units. The experimental results demonstrated that our model and training strategy improve speech quality.
引用
收藏
页码:4009 / 4013
页数:5
相关论文
共 31 条
[1]  
Akiyama T., 2018, P APSIPA HAW US, P660
[2]  
[Anonymous], LANCERS
[3]  
[Anonymous], DEEPVOICE3 PYTORCH
[4]  
Aso M., 2019, P SSW VIENN AUSTR SE, P234
[5]  
Devlin J., 2019, CORR, V1, P4171
[6]  
Fong J., 2019, 10 ISCA SPEECH SYNTH, P223
[7]  
HE K, 2016, P C COMP VIS PATT RE, DOI [DOI 10.1007/978-3-319-46493-0_38, 10.1007/978-3-319-46493-0_38, DOI 10.1109/CVPR.2016.90]
[8]  
Karras T., 2017, 6 INT C LEARNING REP
[9]  
Kastner K, 2019, INT CONF ACOUST SPEE, P5906, DOI [10.1109/icassp.2019.8682880, 10.1109/ICASSP.2019.8682880]
[10]  
Kudo T., 2004, P 2004 C EMP METH NA, P230