E3TTS: End-to-End Text-Based Speech Editing TTS System and Its Applications

被引:0
作者
Liang, Zheng [1 ]
Ma, Ziyang [1 ]
Du, Chenpeng [1 ]
Yu, Kai [1 ]
Chen, Xie [1 ]
机构
[1] Shanghai Jiao Tong Univ, MoE Key Lab Artificial Intelligence, Dept Comp Sci & Engn, X LANCE Lab,AI Inst, Shanghai 200240, Peoples R China
基金
中国国家自然科学基金;
关键词
Hidden Markov models; Speech recognition; Data augmentation; Acoustics; Context modeling; Speech coding; Predictive models; Decoding; Splicing; Training; Automatic speech recognition; code-switching; data augmentation; named entity recognition; text-based speech editing; text-to-speech; ASR;
D O I
10.1109/TASLP.2024.3485466
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Text-based speech editing aims at manipulating part of real audio by modifying the corresponding transcribed text, without being discernible by human auditory system. With the enhanced capability of neural Text-to-speech (TTS), researchers try to tackle speech editing problems with TTS methods. In this paper, we propose E3TTS, a.k.a. end-to-end text-based speech editing TTS system, which combines a text encoder, a speech encoder, and a joint net for speech synthesis and speech editing. E3TTS can insert, replace, and delete speech content at will, by manipulating the given text. Experiments show that our speech editing outperforms strong baselines on HiFiTTS and LibriTTS datasets, speakers of which are seen or unseen, respectively. Further, we introduce E3TTS into data augmentation for automatic speech recognition (ASR) to mitigate the data insufficiency problem in code-switching and named entity recognition scenarios1. E3TTS retains the coherence and reality of the recorded audio compared to past data augmentation methods. The experimental results show significant performance improvements over baseline systems with traditional TTS-based data augmentation. The code and samples of the proposed speech editing model are available at this repository.2
引用
收藏
页码:4810 / 4821
页数:12
相关论文
共 18 条
  • [1] CampNet: Context-Aware Mask Prediction for End-to-End Text-Based Speech Editing
    Wang, Tao
    Yi, Jiangyan
    Fu, Ruibo
    Tao, Jianhua
    Wen, Zhengqi
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 2241 - 2254
  • [2] Emotion selectable end-to-end text-based speech editing
    Wang, Tao
    Yi, Jiangyan
    Fu, Ruibo
    Tao, Jianhua
    Wen, Zhengqi
    Zhang, Chu Yuan
    ARTIFICIAL INTELLIGENCE, 2024, 329
  • [3] SANE-TTS: Stable And Natural End-to-End Multilingual Text-to-Speech
    Cho, Hyunjae
    Jung, Wonbin
    Lee, Junhyeok
    Woo, Sang Hoon
    INTERSPEECH 2022, 2022, : 1 - 5
  • [4] CONTEXT-AWARE MASK PREDICTION NETWORK FOR END-TO-END TEXT-BASED SPEECH EDITING
    Wang, Tao
    Yi, Jiangyan
    Deng, Liqun
    Fu, Ruibo
    Tao, Jianhua
    Wen, Zhengqi
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6082 - 6086
  • [5] Text Enhancement for Paragraph Processing in End-to-End Code-switching TTS
    Qiang, Chunyu
    Tao, Jianhua
    Fu, Ruibo
    Wen, Zhengqi
    Yi, Jiangyan
    Wang, Tao
    Wang, Shiming
    2021 12TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2021,
  • [6] ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT
    Hayashi, Tomoki
    Yamamoto, Ryuichi
    Inoue, Katsuki
    Yoshimura, Takenori
    Watanabe, Shinji
    Toda, Tomoki
    Takeda, Kazuya
    Zhang, Yu
    Tan, Xu
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7654 - 7658
  • [7] Speech Vision: An End-to-End Deep Learning-Based Dysarthric Automatic Speech Recognition System
    Shahamiri, Seyed Reza
    IEEE TRANSACTIONS ON NEURAL SYSTEMS AND REHABILITATION ENGINEERING, 2021, 29 : 852 - 861
  • [8] TSP-TTS: Text-based Style Predictor with Residual Vector Quantization for Expressive Text-to-Speech
    Seong, Donghyun
    Lee, Hoyoung
    Chang, Joon-Hyuk
    INTERSPEECH 2024, 2024, : 1780 - 1784
  • [9] Effective Emotion Transplantation in an End-to-End Text-to-Speech System
    Joo, Young-Sun
    Bae, Hanbin
    Kim, Young-Ik
    Cho, Hoon-Young
    Kang, Hong-Goo
    IEEE ACCESS, 2020, 8 : 161713 - 161719
  • [10] ARM based implementation of Text-To-Speech (TTS) for real time Embedded System
    Rawoof, Abdul
    Kulesh
    Ray, Kailash Chandra
    2014 FIFTH INTERNATIONAL CONFERENCE ON SIGNAL AND IMAGE PROCESSING (ICSIP 2014), 2014, : 192 - 196