End-to-end text-to-speech synthesis with unaligned multiple language units based on attention

被引：2

作者：

Aso, Masashi ^{[1
]}

Takamichi, Shinnosuke ^{[1
]}

Saruwatari, Hiroshi ^{[1
]}

机构：

[1] Univ Tokyo, Grad Sch Informat Sci & Technol, Tokyo, Japan

来源：

INTERSPEECH 2020 | 2020年

关键词：

End-to-end; Text-to-speech; Subword; Progressive training; Transformer;

D O I：

10.21437/Interspeech.2020-2347

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

This paper presents the use of unaligned multiple language units for end-to-end text-to-speech (TTS). End-to-end TTS is a promising technology in that it does not require intermediate representation such as prosodic contexts. However, it causes mispronunciation and unnatural prosody. To alleviate this problem, previous methods have used multiple language units, e.g., phonemes and characters, but required the units to be hard-aligned. In this paper, we propose a multi-input attention structure that simultaneously accepts multiple language units without alignments among them. We consider using not only traditional phonemes and characters but also subwords tokenized in a language-independent manner. We also propose a progressive training strategy to deal with the unaligned multiple language units. The experimental results demonstrated that our model and training strategy improve speech quality.

引用

页码：4009 / 4013

页数：5

共 50 条

[21] Efficient decoding self-attention for end-to-end speech synthesis
Zhao, Wei
Xu, Li
[J]. FRONTIERS OF INFORMATION TECHNOLOGY & ELECTRONIC ENGINEERING, 2022, 23 (07) : 1127 - 1138
[22] End-to-end Text-to-speech for Low-resource Languages by Cross-Lingual Transfer Learning
Chen, Yuan-Jui
Tu, Tao
Yeh, Cheng-chieh
Lee, Hung-yi
[J]. INTERSPEECH 2019, 2019, : 2075 - 2079
[23] Emotion selectable end-to-end text-based speech editing
Wang, Tao
Yi, Jiangyan
Fu, Ruibo
Tao, Jianhua
Wen, Zhengqi
Zhang, Chu Yuan
[J]. ARTIFICIAL INTELLIGENCE, 2024, 329
[24] STREAMING ATTENTION-BASED MODELS WITH AUGMENTED MEMORY FOR END-TO-END SPEECH RECOGNITION
Yeh, Ching-Feng
Wang, Yongqiang
Shi, Yangyang
Wu, Chunyang
Zhang, Frank
Chan, Julian
Seltzer, Michael L.
[J]. 2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 8 - 14
[25] TRANSFORMER-BASED END-TO-END SPEECH RECOGNITION WITH LOCAL DENSE SYNTHESIZER ATTENTION
Xu, Menglong
Li, Shengqiang
Zhang, Xiao-Lei
[J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 5899 - 5903
[26] Speaker Adaptation for Attention-Based End-to-End Speech Recognition
Meng, Zhong
Gaur, Yashesh
Li, Jinyu
Gong, Yifan
[J]. INTERSPEECH 2019, 2019, : 241 - 245
[27] End-to-end Speech-to-Punctuated-Text Recognition
Nozaki, Jumon
Kawahara, Tatsuya
Ishizuka, Kenkichi
Hashimoto, Taiichi
[J]. INTERSPEECH 2022, 2022, : 1811 - 1815
[28] Towards End-to-End Speech-to-Text Summarization
Monteiro, Raul
Pernes, Diogo
[J]. TEXT, SPEECH, AND DIALOGUE, TSD 2023, 2023, 14102 : 304 - 316
[29] IMPROVING UNSUPERVISED STYLE TRANSFER IN END-TO-END SPEECH SYNTHESIS WITH END-TO-END SPEECH RECOGNITION
Liu, Da-Rong
Yang, Chi-Yu
Wu, Szu-Lin
Lee, Hung-Yi
[J]. 2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 640 - 647
[30] A COMPARATIVE STUDY ON END-TO-END SPEECH TO TEXT TRANSLATION
Bahar, Parnia
Bieschke, Tobias
Ney, Hermann
[J]. 2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 792 - 799

← 1 2 3 4 5 →