BERTIVITS: The Posterior Encoder Fusion of Pre-Trained Models and Residual Skip Connections for End-to-End Speech Synthesis

被引:0
|
作者
Wang, Zirui [1 ]
Song, Minqi [1 ]
Zhou, Dongbo [1 ]
机构
[1] Cent China Normal Univ, Fac Artificial Intelligence Educ, Wuhan 430079, Peoples R China
来源
APPLIED SCIENCES-BASEL | 2024年 / 14卷 / 12期
关键词
pre-trained model; text to speech; neural TTS; speech synthesis; end-to-end model;
D O I
10.3390/app14125060
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
Enhancing the naturalness and rhythmicity of generated audio in end-to-end speech synthesis is crucial. The current state-of-the-art (SOTA) model, VITS, utilizes a conditional variational autoencoder architecture. However, it faces challenges, such as limited robustness, due to training solely on text and spectrum data from the training set. Particularly, the posterior encoder struggles with mid- and high-frequency feature extraction, impacting waveform reconstruction. Existing efforts mainly focus on prior encoder enhancements or alignment algorithms, neglecting improvements to spectrum feature extraction. In response, we propose BERTIVITS, a novel model integrating BERT into VITS. Our model features a redesigned posterior encoder with residual connections and utilizes pre-trained models to enhance spectrum feature extraction. Compared to VITS, BERTIVITS shows significant subjective MOS score improvements (0.16 in English, 0.36 in Chinese) and objective Mel-Cepstral coefficient reductions (0.52 in English, 0.49 in Chinese). BERTIVITS is tailored for single-speaker scenarios, improving speech synthesis technology for applications like post-class tutoring or telephone customer service.
引用
收藏
页数:14
相关论文
共 40 条
  • [21] An End-to-End Autonomous Driving Pre-trained Transformer Model for Multi-Behavior-Optimal Trajectory Generation
    Qian, Zelin
    Jiang, Kun
    Zhou, Weitao
    Wen, Junze
    Jing, Cheng
    Cao, Zhong
    Yang, Diange
    2023 IEEE 26TH INTERNATIONAL CONFERENCE ON INTELLIGENT TRANSPORTATION SYSTEMS, ITSC, 2023, : 4730 - 4737
  • [22] SAR Image Despeckling by Deep Neural Networks: from a Pre-Trained Model to an End-to-End Training Strategy
    Dalsasso, Emanuele
    Yang, Xiangli
    Denis, Loic
    Tupin, Florence
    Yang, Wen
    REMOTE SENSING, 2020, 12 (16)
  • [23] SEQ2SEQ-SC: END-TO-END SEMANTIC COMMUNICATION SYSTEMS WITH PRE-TRAINED LANGUAGE MODEL
    Lee, Ju-Hyung
    Lee, Dong-Ho
    Sheen, Eunsoo
    Choi, Thomas
    Pujara, Jay
    FIFTY-SEVENTH ASILOMAR CONFERENCE ON SIGNALS, SYSTEMS & COMPUTERS, IEEECONF, 2023, : 260 - 264
  • [24] END-TO-END SPOKEN LANGUAGE UNDERSTANDING USING TRANSFORMER NETWORKS AND SELF-SUPERVISED PRE-TRAINED FEATURES
    Morais, Edmilson
    Kuo, Hong-Kwang J.
    Thomas, Samuel
    Tuske, Zoltan
    Kingsbury, Brian
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7483 - 7487
  • [25] A Novel End-To-End Learning Framework Based on Optimised Residual Gated Units and Stacked Pre-Trained Layers for Detection of Autism Spectrum Disorders (ASD)
    Sarvalingam, Parameswaran
    Vajravelu, Ashok
    Selvam, Janani
    Bin Ponniran, Asmarashid
    Zaki, Wan Suhaimizan Bin Wan
    Kathambari, P.
    BRAIN-BROAD RESEARCH IN ARTIFICIAL INTELLIGENCE AND NEUROSCIENCE, 2025, 16 (01) : 286 - 305
  • [26] Multi-Encoder Learning and Stream Fusion for Transformer-Based End-to-End Automatic Speech Recognition
    Lohrenz, Timo
    Li, Zhengyang
    Fingscheidt, Tim
    INTERSPEECH 2021, 2021, : 2846 - 2850
  • [27] USING SPEECH SYNTHESIS TO TRAIN END-TO-END SPOKEN LANGUAGE UNDERSTANDING MODELS
    Lugosch, Loren
    Meyer, Brett H.
    Nowrouzezahrai, Derek
    Ravanelli, Mirco
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 8499 - 8503
  • [28] SEMI-SUPERVISED SPEAKER ADAPTATION FOR END-TO-END SPEECH SYNTHESIS WITH PRETRAINED MODELS
    Inoue, Katsuki
    Hara, Sunao
    Abe, Masanobu
    Hayashi, Tomoki
    Yamamoto, Ryuichi
    Watanabe, Shinji
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7634 - 7638
  • [29] Pre-trained Text Representations for Improving Front-End Text Processing in Mandarin Text-to-Speech Synthesis
    Yang, Bing
    Zhong, Jiaqi
    Liu, Shan
    INTERSPEECH 2019, 2019, : 4480 - 4484
  • [30] LARGE CONTEXT END-TO-END AUTOMATIC SPEECH RECOGNITION VIA EXTENSION OF HIERARCHICAL RECURRENT ENCODER-DECODER MODELS
    Masumura, Ryo
    Tanaka, Tomohiro
    Moriya, Takafumi
    Shinohara, Yusuke
    Oba, Takanobu
    Aono, Yushi
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 5661 - 5665