ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT

被引：0

作者：

Hayashi, Tomoki ^{[1
,2
]}

Yamamoto, Ryuichi ^{[3
]}

Inoue, Katsuki ^{[4
]}

Yoshimura, Takenori ^{[1
,2
]}

Watanabe, Shinji ^{[5
]}

Toda, Tomoki ^{[1
]}

Takeda, Kazuya ^{[1
]}

Zhang, Yu ^{[6
]}

Tan, Xu ^{[7
]}

机构：

[1] Nagoya Univ, Nagoya, Aichi, Japan

[2] Human Dataware Lab Co Ltd, Nagoya, Aichi, Japan

[3] LINE Corp, Tokyo, Japan

[4] Okayama Univ, Okayama, Japan

[5] Johns Hopkins Univ, Baltimore, MD 21218 USA

[6] Google AI, Mountain View, CA USA

[7] Microsoft Res, Redmond, WA USA

来源：

2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING | 2020年

关键词：

Open-source; end-to-end; text-to-speech;

D O I：

10.1109/icassp40776.2020.9053512

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

This paper introduces a new end-to-end text-to-speech (E2E-TTS) toolkit named ESPnet-TTS, which is an extension of the open-source speech processing toolkit ESPnet. The toolkit supports state-of-the-art E2E-TTS models, including Tacotron 2, Transformer TTS, and FastSpeech, and also provides recipes inspired by the Kaldi automatic speech recognition (ASR) toolkit. The recipes are based on the design unified with the ESPnet ASR recipe, providing high reproducibility. The toolkit also provides pre-trained models and samples of all of the recipes so that users can use it as a baseline. Furthermore, the unified design enables the integration of ASR functions with TTS, e.g., ASR-based objective evaluation and semisupervised learning with both ASR and TTS models. This paper describes the design of the toolkit and experimental evaluation in comparison with other toolkits. The experimental results show that our models can achieve state-of-the-art performance comparable to the other latest toolkits, resulting in a mean opinion score (MOS) of 4.25 on the LJSpeech dataset. The toolkit is publicly available at https://github.com/espnet/espnet.

引用

页码：7654 / 7658

页数：5

共 23 条

[1] ESPnet: End-to-End Speech Processing Toolkit
Watanabe, Shinji
Hori, Takaaki
Karita, Shigeki
Hayashi, Tomoki
Nishitoba, Jiro
Unno, Yuya
Soplin, Nelson Enrique Yalta
Heymann, Jahn
Wiesner, Mattew
Chen, Nanxin
Renduchintala, Adithya
Ochiai, Tsubasa
19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 2207 - 2211
[2] SANE-TTS: Stable And Natural End-to-End Multilingual Text-to-Speech
Cho, Hyunjae
Jung, Wonbin
Lee, Junhyeok
Woo, Sang Hoon
INTERSPEECH 2022, 2022, : 1 - 5
[3] End-to-End Mongolian Text-to-Speech System
Li, Jingdong
Zhang, Hui
Liu, Rui
Zhang, Xueliang
Bao, Feilong
2018 11TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2018, : 483 - 487
[4] ESPNET-SE: END-TO-END SPEECH ENHANCEMENT AND SEPARATION TOOLKIT DESIGNED FOR ASR INTEGRATION
Li, Chenda
Shi, Jing
Zhang, Wangyou
Subramanian, Aswin Shanmugam
Chang, Xuankai
Kamo, Naoyuki
Hira, Moto
Hayashi, Tomoki
Boeddeker, Christoph
Chen, Zhuo
Watanabe, Shinji
2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 785 - 792
[5] Myanmar Text-to-Speech Synthesis Using End-to-End Model
Qin, Qinglai
Yang, Jian
Li, Peiying
2020 4TH INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING AND INFORMATION RETRIEVAL, NLPIR 2020, 2020, : 6 - 11
[6] On the Training and Testing Data Preparation for End-to-End Text-to-Speech Application
Duc Chung Tran
Khan, M. K. A. Ahamed
Sridevi, S.
2020 11TH IEEE CONTROL AND SYSTEM GRADUATE RESEARCH COLLOQUIUM (ICSGRC), 2020, : 73 - 75
[7] EXPLORING END-TO-END NEURAL TEXT-TO-SPEECH SYNTHESIS FOR ROMANIAN
Dumitrache, Marius
Rebedea, Traian
PROCEEDINGS OF THE 15TH INTERNATIONAL CONFERENCE LINGUISTIC RESOURCES AND TOOLS FOR NATURAL LANGUAGE PROCESSING, 2020, : 93 - 102
[8] NaturalSpeech: End-to-End Text-to-Speech Synthesis With Human-Level Quality
Tan, Xu
Chen, Jiawei
Liu, Haohe
Cong, Jian
Zhang, Chen
Liu, Yanqing
Wang, Xi
Leng, Yichong
Yi, Yuanhao
He, Lei
Zhao, Sheng
Qin, Tao
Soong, Frank
Liu, Tie-Yan
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (06) : 4234 - 4245
[9] EfficientTTS 2: Variational End-to-End Text-to-Speech Synthesis and Voice Conversion
Miao, Chenfeng
Zhu, Qingying
Chen, Minchuan
Ma, Jun
Wang, Shaojun
Xiao, Jing
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 1650 - 1661
[10] End-to-end text-to-speech synthesis with unaligned multiple language units based on attention
Aso, Masashi
Takamichi, Shinnosuke
Saruwatari, Hiroshi
INTERSPEECH 2020, 2020, : 4009 - 4013

← 1 2 3 →