High-Quality Text-to-Speech Implementation via Active Shallow Diffusion Mechanism

被引:0
|
作者
Deng, Junlin [1 ]
Hou, Ruihan [1 ]
Deng, Yan [2 ]
Long, Yongqiu [2 ]
Wu, Ning [1 ]
机构
[1] Beibu Gulf Univ, Key Lab Beibu Gulf Offshore Engn Equipment & Techn, Qinzhou 535011, Peoples R China
[2] Guangxi Univ, Sch Comp & Elect & Informat, Nanning 530004, Peoples R China
基金
中国国家自然科学基金;
关键词
text-to-speech; speech synthesis; diffusion probabilistic model; MixGAN; mel-spectrogram;
D O I
10.3390/s25030833
中图分类号
O65 [分析化学];
学科分类号
070302 ; 081704 ;
摘要
Denoising diffusion probabilistic models (DDPMs) have proven to be useful in text-to-speech (TTS) tasks; however, it has been a challenge for traditional diffusion models to carry out real-time processing because of the need for hundreds of sampling steps during the iteration. In this work, a two-stage fast inference and efficient diffusion-based acoustic model of TTS, the Cascaded MixGAN-TTS (CMG-TTS), is proposed to address this problem. An active shallow diffusion mechanism is adopted to divide the CMG-TTS training process into two stages. Specifically, a basic acoustic model in the first stage is trained to provide valuable a priori knowledge for the second stage, and for the underlying acoustic modeling, a mixture combination mechanism-based linguistic encoder is introduced to work with pitch and energy predictors. In the following stage of processing, a post-net is used to optimize the mel-spectrogram reconstruction performance. The CMG-TTS is evaluated on datasets such as the AISHELL3 and LJSpeech, and the experiments show that the CMG-TTS achieves satisfactory results in both subjective and objective evaluation metrics with only one denoising step. Compared to other TTS models based on diffusion modeling, the CMG-TTS obtains a leading score in the real time factor (RTF), and both stages of the CMG-TTS are effective in the ablation studies.
引用
收藏
页数:13
相关论文
共 50 条
  • [1] ProDiff: Progressive Fast Diffusion Model for High-Quality Text-to-Speech
    Huang, Rongjie
    Zhao, Zhou
    Liu, Huadai
    Liu, Jinglin
    Cui, Chenye
    Ren, Yi
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 2595 - 2605
  • [2] High-quality text-to-speech synthesis: An overview
    Dutoit, T.
    Journal of Electrical and Electronics Engineering, Australia, 1997, 17 (01): : 25 - 36
  • [3] EfficientTTS: An Efficient and High-Quality Text-to-Speech Architecture
    Miao, Chenfeng
    Liang, Shuang
    Liu, Zhencheng
    Chen, Minchuan
    Ma, Jun
    Wang, Shaojun
    Xiao, Jing
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139, 2021, 139
  • [4] PortaSpeech: Portable and High-Quality Generative Text-to-Speech
    Ren, Yi
    Liu, Jinglin
    Zhao, Zhou
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [5] SpikeVoice: High-Quality Text-to-Speech Via Efficient Spiking Neural Network
    Wang, Kexin
    Zhang, Jiahong
    Ren, Yong
    Yao, Man
    Di Shang
    Xu, Bo
    Li, Guoqi
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 7927 - 7940
  • [6] High-quality prosody generation in Mandarin text-to-speech system
    Guo, Qing
    Zhang, Jie
    Katae, Nobuyuki
    Yu, Hao
    Fujitsu Scientific and Technical Journal, 2010, 46 (01): : 40 - 46
  • [7] High-Quality Prosody Generation in Mandarin Text-to-Speech System
    Guo, Qing
    Zhang, Jie
    Katae, Nobuyuki
    Yu, Hao
    FUJITSU SCIENTIFIC & TECHNICAL JOURNAL, 2010, 46 (01): : 40 - 46
  • [8] An Advanced NLP Framework for High-Quality Text-to-Speech Synthesis
    Ungurean, Catalin
    Burileanu, Dragos
    2011 6TH CONFERENCE ON SPEECH TECHNOLOGY AND HUMAN-COMPUTER DIALOGUE (SPED), 2011,
  • [9] Implementation of high quality text-to-speech using words and diphones
    Shukla, SR
    Barnwell, TP
    2001 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I-VI, PROCEEDINGS: VOL I: SPEECH PROCESSING 1; VOL II: SPEECH PROCESSING 2 IND TECHNOL TRACK DESIGN & IMPLEMENTATION OF SIGNAL PROCESSING SYSTEMS NEURALNETWORKS FOR SIGNAL PROCESSING; VOL III: IMAGE & MULTIDIMENSIONAL SIGNAL PROCESSING MULTIMEDIA SIGNAL PROCESSING - VOL IV: SIGNAL PROCESSING FOR COMMUNICATIONS; VOL V: SIGNAL PROCESSING EDUCATION SENSOR ARRAY & MULTICHANNEL SIGNAL PROCESSING AUDIO & ELECTROACOUSTICS; VOL VI: SIGNAL PROCESSING THEORY & METHODS STUDENT FORUM, 2001, : 4020 - 4020
  • [10] VARIANCEFLOW: HIGH-QUALITY AND CONTROLLABLE TEXT-TO-SPEECH USING VARIANCE INFORMATION VIA NORMALIZING FLOW
    Lee, Yoonhyung
    Yang, Jinhyeok
    Jung, Kyomin
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7477 - 7481