End-to-end text-to-speech synthesis with unaligned multiple language units based on attention

被引：2

作者：

Aso, Masashi ^{[1
]}

Takamichi, Shinnosuke ^{[1
]}

Saruwatari, Hiroshi ^{[1
]}

机构：

[1] Univ Tokyo, Grad Sch Informat Sci & Technol, Tokyo, Japan

来源：

INTERSPEECH 2020 | 2020年

关键词：

End-to-end; Text-to-speech; Subword; Progressive training; Transformer;

D O I：

10.21437/Interspeech.2020-2347

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

This paper presents the use of unaligned multiple language units for end-to-end text-to-speech (TTS). End-to-end TTS is a promising technology in that it does not require intermediate representation such as prosodic contexts. However, it causes mispronunciation and unnatural prosody. To alleviate this problem, previous methods have used multiple language units, e.g., phonemes and characters, but required the units to be hard-aligned. In this paper, we propose a multi-input attention structure that simultaneously accepts multiple language units without alignments among them. We consider using not only traditional phonemes and characters but also subwords tokenized in a language-independent manner. We also propose a progressive training strategy to deal with the unaligned multiple language units. The experimental results demonstrated that our model and training strategy improve speech quality.

引用

页码：4009 / 4013

页数：5

共 50 条

[11] END-TO-END TEXT-TO-SPEECH USING LATENT DURATION BASED ON VQ-VAE
Yasuda, Yusuke
Wang, Xin
Yamagishi, Junichi
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 5694 - 5698
[12] EXPLICIT ALIGNMENT OF TEXT AND SPEECH ENCODINGS FOR ATTENTION-BASED END-TO-END SPEECH RECOGNITION
Drexler, Jennifer
Glass, James
2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 913 - 919
[13] Optimization for Low-Resource Speaker Adaptation in End-to-End Text-to-Speech
Hong, Changi
Lee, Jung Hyuk
Jeon, Moongu
Kim, Hong Kook
2024 IEEE 21ST CONSUMER COMMUNICATIONS & NETWORKING CONFERENCE, CCNC, 2024, : 1060 - 1061
[14] Adversarial Learning of Intermediate Acoustic Feature for End-to-End Lightweight Text-to-Speech
Yoon, Hyungchan
Um, Seyun
Kim, Changhwan
Kang, Hong-Goo
INTERSPEECH 2023, 2023, : 3023 - 3027
[15] Phonetic and Prosodic Information Estimation from Texts for Genuine Japanese End-to-End Text-to-Speech
Kakegawa, Naoto
Hara, Sunao
Abe, Masanobu
Ijima, Yusuke
INTERSPEECH 2021, 2021, : 126 - 130
[16] SANE-TTS: Stable And Natural End-to-End Multilingual Text-to-Speech
Cho, Hyunjae
Jung, Wonbin
Lee, Junhyeok
Woo, Sang Hoon
INTERSPEECH 2022, 2022, : 1 - 5
[17] Reinforce-Aligner: Reinforcement Alignment Search for Robust End-to-End Text-to-Speech
Chung, Hyunseung
Lee, Sang-Hoon
Lee, Seong-Whan
INTERSPEECH 2021, 2021, : 3635 - 3639
[18] ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT
Hayashi, Tomoki
Yamamoto, Ryuichi
Inoue, Katsuki
Yoshimura, Takenori
Watanabe, Shinji
Toda, Tomoki
Takeda, Kazuya
Zhang, Yu
Tan, Xu
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7654 - 7658
[19] You Do Not Need More Data: Improving End-To-End Speech Recognition by Text-To-Speech Data Augmentation
Laptev, Aleksandr
Korostik, Roman
Svischev, Aleksey
Andrusenko, Andrei
Medennikov, Ivan
Rybin, Sergey
2020 13TH INTERNATIONAL CONGRESS ON IMAGE AND SIGNAL PROCESSING, BIOMEDICAL ENGINEERING AND INFORMATICS (CISP-BMEI 2020), 2020, : 439 - 444
[20] BLSTM-CRF Based End-to-End Prosodic Boundary Prediction with Context Sensitive Embeddings in A Text-to-Speech Front-End
Zheng, Yibin
Tao, Jianhua
Wen, Zhengqi
Li, Ya
19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 47 - 51

← 1 2 3 4 5 →