ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT

被引:0
作者
Hayashi, Tomoki [1 ,2 ]
Yamamoto, Ryuichi [3 ]
Inoue, Katsuki [4 ]
Yoshimura, Takenori [1 ,2 ]
Watanabe, Shinji [5 ]
Toda, Tomoki [1 ]
Takeda, Kazuya [1 ]
Zhang, Yu [6 ]
Tan, Xu [7 ]
机构
[1] Nagoya Univ, Nagoya, Aichi, Japan
[2] Human Dataware Lab Co Ltd, Nagoya, Aichi, Japan
[3] LINE Corp, Tokyo, Japan
[4] Okayama Univ, Okayama, Japan
[5] Johns Hopkins Univ, Baltimore, MD 21218 USA
[6] Google AI, Mountain View, CA USA
[7] Microsoft Res, Redmond, WA USA
来源
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING | 2020年
关键词
Open-source; end-to-end; text-to-speech;
D O I
10.1109/icassp40776.2020.9053512
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This paper introduces a new end-to-end text-to-speech (E2E-TTS) toolkit named ESPnet-TTS, which is an extension of the open-source speech processing toolkit ESPnet. The toolkit supports state-of-the-art E2E-TTS models, including Tacotron 2, Transformer TTS, and FastSpeech, and also provides recipes inspired by the Kaldi automatic speech recognition (ASR) toolkit. The recipes are based on the design unified with the ESPnet ASR recipe, providing high reproducibility. The toolkit also provides pre-trained models and samples of all of the recipes so that users can use it as a baseline. Furthermore, the unified design enables the integration of ASR functions with TTS, e.g., ASR-based objective evaluation and semisupervised learning with both ASR and TTS models. This paper describes the design of the toolkit and experimental evaluation in comparison with other toolkits. The experimental results show that our models can achieve state-of-the-art performance comparable to the other latest toolkits, resulting in a mean opinion score (MOS) of 4.25 on the LJSpeech dataset. The toolkit is publicly available at https://github.com/espnet/espnet.
引用
收藏
页码:7654 / 7658
页数:5
相关论文
共 23 条
  • [1] ESPnet: End-to-End Speech Processing Toolkit
    Watanabe, Shinji
    Hori, Takaaki
    Karita, Shigeki
    Hayashi, Tomoki
    Nishitoba, Jiro
    Unno, Yuya
    Soplin, Nelson Enrique Yalta
    Heymann, Jahn
    Wiesner, Mattew
    Chen, Nanxin
    Renduchintala, Adithya
    Ochiai, Tsubasa
    19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 2207 - 2211
  • [2] SANE-TTS: Stable And Natural End-to-End Multilingual Text-to-Speech
    Cho, Hyunjae
    Jung, Wonbin
    Lee, Junhyeok
    Woo, Sang Hoon
    INTERSPEECH 2022, 2022, : 1 - 5
  • [3] End-to-End Mongolian Text-to-Speech System
    Li, Jingdong
    Zhang, Hui
    Liu, Rui
    Zhang, Xueliang
    Bao, Feilong
    2018 11TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2018, : 483 - 487
  • [4] ESPNET-SE: END-TO-END SPEECH ENHANCEMENT AND SEPARATION TOOLKIT DESIGNED FOR ASR INTEGRATION
    Li, Chenda
    Shi, Jing
    Zhang, Wangyou
    Subramanian, Aswin Shanmugam
    Chang, Xuankai
    Kamo, Naoyuki
    Hira, Moto
    Hayashi, Tomoki
    Boeddeker, Christoph
    Chen, Zhuo
    Watanabe, Shinji
    2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 785 - 792
  • [5] Myanmar Text-to-Speech Synthesis Using End-to-End Model
    Qin, Qinglai
    Yang, Jian
    Li, Peiying
    2020 4TH INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING AND INFORMATION RETRIEVAL, NLPIR 2020, 2020, : 6 - 11
  • [6] On the Training and Testing Data Preparation for End-to-End Text-to-Speech Application
    Duc Chung Tran
    Khan, M. K. A. Ahamed
    Sridevi, S.
    2020 11TH IEEE CONTROL AND SYSTEM GRADUATE RESEARCH COLLOQUIUM (ICSGRC), 2020, : 73 - 75
  • [7] EXPLORING END-TO-END NEURAL TEXT-TO-SPEECH SYNTHESIS FOR ROMANIAN
    Dumitrache, Marius
    Rebedea, Traian
    PROCEEDINGS OF THE 15TH INTERNATIONAL CONFERENCE LINGUISTIC RESOURCES AND TOOLS FOR NATURAL LANGUAGE PROCESSING, 2020, : 93 - 102
  • [8] NaturalSpeech: End-to-End Text-to-Speech Synthesis With Human-Level Quality
    Tan, Xu
    Chen, Jiawei
    Liu, Haohe
    Cong, Jian
    Zhang, Chen
    Liu, Yanqing
    Wang, Xi
    Leng, Yichong
    Yi, Yuanhao
    He, Lei
    Zhao, Sheng
    Qin, Tao
    Soong, Frank
    Liu, Tie-Yan
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (06) : 4234 - 4245
  • [9] EfficientTTS 2: Variational End-to-End Text-to-Speech Synthesis and Voice Conversion
    Miao, Chenfeng
    Zhu, Qingying
    Chen, Minchuan
    Ma, Jun
    Wang, Shaojun
    Xiao, Jing
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 1650 - 1661
  • [10] End-to-end text-to-speech synthesis with unaligned multiple language units based on attention
    Aso, Masashi
    Takamichi, Shinnosuke
    Saruwatari, Hiroshi
    INTERSPEECH 2020, 2020, : 4009 - 4013