BERTIVITS: The Posterior Encoder Fusion of Pre-Trained Models and Residual Skip Connections for End-to-End Speech Synthesis

被引：0

作者：

Wang, Zirui ^{[1
]}

Song, Minqi ^{[1
]}

Zhou, Dongbo ^{[1
]}

机构：

[1] Cent China Normal Univ, Fac Artificial Intelligence Educ, Wuhan 430079, Peoples R China

来源：

APPLIED SCIENCES-BASEL | 2024年 / 14卷 / 12期

关键词：

pre-trained model; text to speech; neural TTS; speech synthesis; end-to-end model;

D O I：

10.3390/app14125060

中图分类号：

O6 [化学];

学科分类号：

0703 ;

摘要：

Enhancing the naturalness and rhythmicity of generated audio in end-to-end speech synthesis is crucial. The current state-of-the-art (SOTA) model, VITS, utilizes a conditional variational autoencoder architecture. However, it faces challenges, such as limited robustness, due to training solely on text and spectrum data from the training set. Particularly, the posterior encoder struggles with mid- and high-frequency feature extraction, impacting waveform reconstruction. Existing efforts mainly focus on prior encoder enhancements or alignment algorithms, neglecting improvements to spectrum feature extraction. In response, we propose BERTIVITS, a novel model integrating BERT into VITS. Our model features a redesigned posterior encoder with residual connections and utilizes pre-trained models to enhance spectrum feature extraction. Compared to VITS, BERTIVITS shows significant subjective MOS score improvements (0.16 in English, 0.36 in Chinese) and objective Mel-Cepstral coefficient reductions (0.52 in English, 0.49 in Chinese). BERTIVITS is tailored for single-speaker scenarios, improving speech synthesis technology for applications like post-class tutoring or telephone customer service.

引用

页数：14

共 40 条

[21] An End-to-End Autonomous Driving Pre-trained Transformer Model for Multi-Behavior-Optimal Trajectory Generation
Qian, Zelin
Jiang, Kun
Zhou, Weitao
Wen, Junze
Jing, Cheng
Cao, Zhong
Yang, Diange
2023 IEEE 26TH INTERNATIONAL CONFERENCE ON INTELLIGENT TRANSPORTATION SYSTEMS, ITSC, 2023, : 4730 - 4737
[22] SAR Image Despeckling by Deep Neural Networks: from a Pre-Trained Model to an End-to-End Training Strategy
Dalsasso, Emanuele
Yang, Xiangli
Denis, Loic
Tupin, Florence
Yang, Wen
REMOTE SENSING, 2020, 12 (16)
[23] SEQ2SEQ-SC: END-TO-END SEMANTIC COMMUNICATION SYSTEMS WITH PRE-TRAINED LANGUAGE MODEL
Lee, Ju-Hyung
Lee, Dong-Ho
Sheen, Eunsoo
Choi, Thomas
Pujara, Jay
FIFTY-SEVENTH ASILOMAR CONFERENCE ON SIGNALS, SYSTEMS & COMPUTERS, IEEECONF, 2023, : 260 - 264
[24] END-TO-END SPOKEN LANGUAGE UNDERSTANDING USING TRANSFORMER NETWORKS AND SELF-SUPERVISED PRE-TRAINED FEATURES
Morais, Edmilson
Kuo, Hong-Kwang J.
Thomas, Samuel
Tuske, Zoltan
Kingsbury, Brian
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7483 - 7487
[25] A Novel End-To-End Learning Framework Based on Optimised Residual Gated Units and Stacked Pre-Trained Layers for Detection of Autism Spectrum Disorders (ASD)
Sarvalingam, Parameswaran
Vajravelu, Ashok
Selvam, Janani
Bin Ponniran, Asmarashid
Zaki, Wan Suhaimizan Bin Wan
Kathambari, P.
BRAIN-BROAD RESEARCH IN ARTIFICIAL INTELLIGENCE AND NEUROSCIENCE, 2025, 16 (01) : 286 - 305
[26] Multi-Encoder Learning and Stream Fusion for Transformer-Based End-to-End Automatic Speech Recognition
Lohrenz, Timo
Li, Zhengyang
Fingscheidt, Tim
INTERSPEECH 2021, 2021, : 2846 - 2850
[27] USING SPEECH SYNTHESIS TO TRAIN END-TO-END SPOKEN LANGUAGE UNDERSTANDING MODELS
Lugosch, Loren
Meyer, Brett H.
Nowrouzezahrai, Derek
Ravanelli, Mirco
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 8499 - 8503
[28] SEMI-SUPERVISED SPEAKER ADAPTATION FOR END-TO-END SPEECH SYNTHESIS WITH PRETRAINED MODELS
Inoue, Katsuki
Hara, Sunao
Abe, Masanobu
Hayashi, Tomoki
Yamamoto, Ryuichi
Watanabe, Shinji
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7634 - 7638
[29] Pre-trained Text Representations for Improving Front-End Text Processing in Mandarin Text-to-Speech Synthesis
Yang, Bing
Zhong, Jiaqi
Liu, Shan
INTERSPEECH 2019, 2019, : 4480 - 4484
[30] LARGE CONTEXT END-TO-END AUTOMATIC SPEECH RECOGNITION VIA EXTENSION OF HIERARCHICAL RECURRENT ENCODER-DECODER MODELS
Masumura, Ryo
Tanaka, Tomohiro
Moriya, Takafumi
Shinohara, Yusuke
Oba, Takanobu
Aono, Yushi
2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 5661 - 5665

← 1 2 3 4 →