E3TTS: End-to-End Text-Based Speech Editing TTS System and Its Applications

被引：0

作者：

Liang, Zheng ^{[1
]}

Ma, Ziyang ^{[1
]}

Du, Chenpeng ^{[1
]}

Yu, Kai ^{[1
]}

Chen, Xie ^{[1
]}

机构：

[1] Shanghai Jiao Tong Univ, MoE Key Lab Artificial Intelligence, Dept Comp Sci & Engn, X LANCE Lab,AI Inst, Shanghai 200240, Peoples R China

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2024年 / 32卷

基金：

中国国家自然科学基金;

关键词：

Hidden Markov models; Speech recognition; Data augmentation; Acoustics; Context modeling; Speech coding; Predictive models; Decoding; Splicing; Training; Automatic speech recognition; code-switching; data augmentation; named entity recognition; text-based speech editing; text-to-speech; ASR;

D O I：

10.1109/TASLP.2024.3485466

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Text-based speech editing aims at manipulating part of real audio by modifying the corresponding transcribed text, without being discernible by human auditory system. With the enhanced capability of neural Text-to-speech (TTS), researchers try to tackle speech editing problems with TTS methods. In this paper, we propose E3TTS, a.k.a. end-to-end text-based speech editing TTS system, which combines a text encoder, a speech encoder, and a joint net for speech synthesis and speech editing. E3TTS can insert, replace, and delete speech content at will, by manipulating the given text. Experiments show that our speech editing outperforms strong baselines on HiFiTTS and LibriTTS datasets, speakers of which are seen or unseen, respectively. Further, we introduce E3TTS into data augmentation for automatic speech recognition (ASR) to mitigate the data insufficiency problem in code-switching and named entity recognition scenarios1. E3TTS retains the coherence and reality of the recorded audio compared to past data augmentation methods. The experimental results show significant performance improvements over baseline systems with traditional TTS-based data augmentation. The code and samples of the proposed speech editing model are available at this repository.2

引用

页码：4810 / 4821

页数：12

共 18 条

[1] CampNet: Context-Aware Mask Prediction for End-to-End Text-Based Speech Editing
Wang, Tao
Yi, Jiangyan
Fu, Ruibo
Tao, Jianhua
Wen, Zhengqi
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 2241 - 2254
[2] Emotion selectable end-to-end text-based speech editing
Wang, Tao
Yi, Jiangyan
Fu, Ruibo
Tao, Jianhua
Wen, Zhengqi
Zhang, Chu Yuan
ARTIFICIAL INTELLIGENCE, 2024, 329
[3] SANE-TTS: Stable And Natural End-to-End Multilingual Text-to-Speech
Cho, Hyunjae
Jung, Wonbin
Lee, Junhyeok
Woo, Sang Hoon
INTERSPEECH 2022, 2022, : 1 - 5
[4] CONTEXT-AWARE MASK PREDICTION NETWORK FOR END-TO-END TEXT-BASED SPEECH EDITING
Wang, Tao
Yi, Jiangyan
Deng, Liqun
Fu, Ruibo
Tao, Jianhua
Wen, Zhengqi
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6082 - 6086
[5] Text Enhancement for Paragraph Processing in End-to-End Code-switching TTS
Qiang, Chunyu
Tao, Jianhua
Fu, Ruibo
Wen, Zhengqi
Yi, Jiangyan
Wang, Tao
Wang, Shiming
2021 12TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2021,
[6] ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT
Hayashi, Tomoki
Yamamoto, Ryuichi
Inoue, Katsuki
Yoshimura, Takenori
Watanabe, Shinji
Toda, Tomoki
Takeda, Kazuya
Zhang, Yu
Tan, Xu
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7654 - 7658
[7] Speech Vision: An End-to-End Deep Learning-Based Dysarthric Automatic Speech Recognition System
Shahamiri, Seyed Reza
IEEE TRANSACTIONS ON NEURAL SYSTEMS AND REHABILITATION ENGINEERING, 2021, 29 : 852 - 861
[8] TSP-TTS: Text-based Style Predictor with Residual Vector Quantization for Expressive Text-to-Speech
Seong, Donghyun
Lee, Hoyoung
Chang, Joon-Hyuk
INTERSPEECH 2024, 2024, : 1780 - 1784
[9] Effective Emotion Transplantation in an End-to-End Text-to-Speech System
Joo, Young-Sun
Bae, Hanbin
Kim, Young-Ik
Cho, Hoon-Young
Kang, Hong-Goo
IEEE ACCESS, 2020, 8 : 161713 - 161719
[10] ARM based implementation of Text-To-Speech (TTS) for real time Embedded System
Rawoof, Abdul
Kulesh
Ray, Kailash Chandra
2014 FIFTH INTERNATIONAL CONFERENCE ON SIGNAL AND IMAGE PROCESSING (ICSIP 2014), 2014, : 192 - 196

← 1 2 →