End-to-end text-to-speech synthesis with unaligned multiple language units based on attention

被引:2
作者
Aso, Masashi [1 ]
Takamichi, Shinnosuke [1 ]
Saruwatari, Hiroshi [1 ]
机构
[1] Univ Tokyo, Grad Sch Informat Sci & Technol, Tokyo, Japan
来源
INTERSPEECH 2020 | 2020年
关键词
End-to-end; Text-to-speech; Subword; Progressive training; Transformer;
D O I
10.21437/Interspeech.2020-2347
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
This paper presents the use of unaligned multiple language units for end-to-end text-to-speech (TTS). End-to-end TTS is a promising technology in that it does not require intermediate representation such as prosodic contexts. However, it causes mispronunciation and unnatural prosody. To alleviate this problem, previous methods have used multiple language units, e.g., phonemes and characters, but required the units to be hard-aligned. In this paper, we propose a multi-input attention structure that simultaneously accepts multiple language units without alignments among them. We consider using not only traditional phonemes and characters but also subwords tokenized in a language-independent manner. We also propose a progressive training strategy to deal with the unaligned multiple language units. The experimental results demonstrated that our model and training strategy improve speech quality.
引用
收藏
页码:4009 / 4013
页数:5
相关论文
共 50 条
[41]   STRUCTURED SPARSE ATTENTION FOR END-TO-END AUTOMATIC SPEECH RECOGNITION [J].
Xue, Jiabin ;
Zheng, Tieran ;
Han, Jiqing .
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, :7044-7048
[42]   Improved training of end-to-end attention models for speech recognition [J].
Zeyer, Albert ;
Irie, Kazuki ;
Schlueter, Ralf ;
Ney, Hermann .
19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, :7-11
[43]   Self-Attention Transducers for End-to-End Speech Recognition [J].
Tian, Zhengkun ;
Yi, Jiangyan ;
Tao, Jianhua ;
Bai, Ye ;
Wen, Zhengqi .
INTERSPEECH 2019, 2019, :4395-4399
[44]   Scene text spotting based on end-to-end [J].
Wei G. ;
Rong W. ;
Liang Y. ;
Xiao X. ;
Liu X. .
Journal of Intelligent and Fuzzy Systems, 2021, 40 (05) :8871-8881
[45]   Transformer-based end-to-end scene text recognition [J].
Zhu, Xinghao ;
Zhang, Zhi .
PROCEEDINGS OF THE 2021 IEEE 16TH CONFERENCE ON INDUSTRIAL ELECTRONICS AND APPLICATIONS (ICIEA 2021), 2021, :1691-1695
[46]   IMPROVING ATTENTION-BASED END-TO-END SPEECH RECOGNITION BY MONOTONIC ALIGNMENT ATTENTION MATRIX RECONSTRUCTION [J].
Zhuang, Ziyang ;
Zhou, Kun ;
Mao, Chenfeng ;
Fang, Ming ;
Wei, Tao ;
Li, Zijian ;
Hu, Wei ;
Wang, Shaojun ;
Xiao, Jing .
2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2024), 2024, :10546-10550
[47]   SIMPLIFIED SELF-ATTENTION FOR TRANSFORMER-BASED END-TO-END SPEECH RECOGNITION [J].
Luo, Haoneng ;
Zhang, Shiliang ;
Lei, Ming ;
Xie, Lei .
2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, :75-81
[48]   TRANSFORMER-BASED ONLINE CTC/ATTENTION END-TO-END SPEECH RECOGNITION ARCHITECTURE [J].
Miao, Haoran ;
Cheng, Gaofeng ;
Gao, Changfeng ;
Zhang, Pengyuan ;
Yan, Yonghong .
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, :6084-6088
[49]   STREAMING END-TO-END SPEECH RECOGNITION WITH JOINT CTC-ATTENTION BASED MODELS [J].
Moritz, Niko ;
Hori, Takaaki ;
Le Roux, Jonathan .
2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, :936-943
[50]   End to End Text to Speech Synthesis for Malay Language using Tacotron and Tacotron 2 [J].
Aziz, Azrul Fahmi Abdul ;
Tiun, Sabrina ;
Ruslan, Noraini .
INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2023, 14 (06) :415-421