E3TTS: End-to-End Text-Based Speech Editing TTS System and Its Applications

被引：0

作者：

Liang, Zheng ^{[1
]}

Ma, Ziyang ^{[1
]}

Du, Chenpeng ^{[1
]}

Yu, Kai ^{[1
]}

Chen, Xie ^{[1
]}

机构：

[1] Shanghai Jiao Tong Univ, MoE Key Lab Artificial Intelligence, Dept Comp Sci & Engn, X LANCE Lab,AI Inst, Shanghai 200240, Peoples R China

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2024年 / 32卷

基金：

中国国家自然科学基金;

关键词：

Hidden Markov models; Speech recognition; Data augmentation; Acoustics; Context modeling; Speech coding; Predictive models; Decoding; Splicing; Training; Automatic speech recognition; code-switching; data augmentation; named entity recognition; text-based speech editing; text-to-speech; ASR;

D O I：

10.1109/TASLP.2024.3485466

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Text-based speech editing aims at manipulating part of real audio by modifying the corresponding transcribed text, without being discernible by human auditory system. With the enhanced capability of neural Text-to-speech (TTS), researchers try to tackle speech editing problems with TTS methods. In this paper, we propose E3TTS, a.k.a. end-to-end text-based speech editing TTS system, which combines a text encoder, a speech encoder, and a joint net for speech synthesis and speech editing. E3TTS can insert, replace, and delete speech content at will, by manipulating the given text. Experiments show that our speech editing outperforms strong baselines on HiFiTTS and LibriTTS datasets, speakers of which are seen or unseen, respectively. Further, we introduce E3TTS into data augmentation for automatic speech recognition (ASR) to mitigate the data insufficiency problem in code-switching and named entity recognition scenarios1. E3TTS retains the coherence and reality of the recorded audio compared to past data augmentation methods. The experimental results show significant performance improvements over baseline systems with traditional TTS-based data augmentation. The code and samples of the proposed speech editing model are available at this repository.2

引用

页码：4810 / 4821

页数：12

共 18 条

[11] EXPLICIT ALIGNMENT OF TEXT AND SPEECH ENCODINGS FOR ATTENTION-BASED END-TO-END SPEECH RECOGNITION
Drexler, Jennifer
Glass, James
2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 913 - 919
[12] End-to-end speech recognition system based on improved CLDNN structure
Feng, Yujie
Zhang, Yi
Xu, Xuan
PROCEEDINGS OF 2019 IEEE 8TH JOINT INTERNATIONAL INFORMATION TECHNOLOGY AND ARTIFICIAL INTELLIGENCE CONFERENCE (ITAIC 2019), 2019, : 538 - 542
[13] END-TO-END TEXT-TO-SPEECH USING LATENT DURATION BASED ON VQ-VAE
Yasuda, Yusuke
Wang, Xin
Yamagishi, Junichi
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 5694 - 5698
[14] End-to-end text-to-speech synthesis with unaligned multiple language units based on attention
Aso, Masashi
Takamichi, Shinnosuke
Saruwatari, Hiroshi
INTERSPEECH 2020, 2020, : 4009 - 4013
[15] Development of CRF and CTC Based End-To-End Kazakh Speech Recognition System
Oralbekova, Dina
Mamyrbayev, Orken
Othman, Mohamed
Alimhan, Keylan
Zhumazhanov, Bagashar
Nuranbayeva, Bulbul
INTELLIGENT INFORMATION AND DATABASE SYSTEMS, ACIIDS 2022, PT I, 2022, 13757 : 519 - 531
[16] Hardware Accelerator for Transformer based End-to-End Automatic Speech Recognition System
Yamini, Shaarada D.
Mirishkar, Ganesh S.
Vuppala, Anil Kumar
Purini, Suresh
2023 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS, IPDPSW, 2023, : 93 - 100
[17] Fast offline transformer-based end-to-end automatic speech recognition for real-world applications
Oh, Yoo Rhee
Park, Kiyoung
Park, Jeon Gue
ETRI JOURNAL, 2022, 44 (03) : 476 - 490
[18] Advance research in agricultural text-to-speech: the word segmentation of analytic language and the deep learning-based end-to-end system
Li, Xinxing
Ma, Diankun
Yin, Baoquan
COMPUTERS AND ELECTRONICS IN AGRICULTURE, 2021, 180

← 1 2 →