A Mel Spectrogram Enhancement Paradigm Based on CWT in Speech Synthesis

被引:0
|
作者
Hu, Guoqiang [1 ]
Tan, Huaning [1 ]
Li, Ruilai [1 ]
机构
[1] Jinan Univ, Int Sch, Guangzhou, Peoples R China
来源
2024 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING, IALP 2024 | 2024年
关键词
Mel spectrogram; Speech Synthesis; Fine Grainedness; Continuous Wavelet Transform;
D O I
10.1109/IALP63756.2024.10661192
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Acoustic features play an important role in improving the quality of the synthesised speech. Currently, the Mel spectrogram is a widely employed acoustic feature in most acoustic models. However, due to the fine-grained loss caused by its Fourier transform process, the clarity of speech synthesised by Mel spectrogram is compromised in mutant signals. In order to obtain a more detailed Mel spectrogram, we propose a Mel spectrogram enhancement paradigm based on the continuous wavelet transform (CWT). This paradigm introduces an additional task: a more detailed wavelet spectrogram, which like the post-processing network takes as input the Mel spectrogram output by the decoder. We choose Tacotron2 and Fastspeech2 for experimental validation in order to test autoregressive (AR) and non-autoregressive (NAR) speech systems, respectively. The experimental results demonstrate that the speech synthesised using the model with the Mel spectrogram enhancement paradigm exhibits higher MOS, with an improvement of 0.14 and 0.09 compared to the baseline model, respectively. These findings provide some validation for the universality of the enhancement paradigm, as they demonstrate the success of the paradigm in different architectures.
引用
收藏
页码:401 / 405
页数:5
相关论文
共 50 条
  • [21] PHONE-INFORMED REFINEMENT OF SYNTHESIZED MEL SPECTROGRAM FOR DATA AUGMENTATION IN SPEECH RECOGNITION
    Ueno, Sei
    Kawahara, Tatsuya
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 8572 - 8576
  • [22] Polish dance music classification based on mel spectrogram decomposition
    Chwaleba, Kinga
    Wach, Weronika
    ADVANCES IN SCIENCE AND TECHNOLOGY-RESEARCH JOURNAL, 2025, 19 (02)
  • [23] Subband-based Spectrogram Fusion for Speech Enhancement by Combining Mapping and Masking Approaches
    Shi, Hao
    Wang, Longbiao
    Li, Sheng
    Dang, Jianwu
    Kawahara, Tatsuya
    PROCEEDINGS OF 2022 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2022, : 286 - 292
  • [24] A Flow-Based Deep Latent Variable Model for Speech Spectrogram Modeling and Enhancement
    Nugraha, Aditya Arie
    Sekiguchi, Kouhei
    Yoshii, Kazuyoshi
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2020, 28 : 1104 - 1117
  • [25] MIST-Tacotron: End-to-End Emotional Speech Synthesis Using Mel-Spectrogram Image Style Transfer
    Moon, Sungwoo
    Kim, Sunghyun
    Choi, Yong-Hoon
    IEEE ACCESS, 2022, 10 : 25455 - 25463
  • [26] Masked Spectrogram Prediction for Unsupervised Domain Adaptation in Speech Enhancement
    Zmolikova, Katerina
    Pedersen, Michael Syskind
    Jensen, Jesper
    IEEE OPEN JOURNAL OF SIGNAL PROCESSING, 2024, 5 : 274 - 283
  • [27] MelTrans: Mel-Spectrogram Relationship-Learning for Speech Emotion Recognition via Transformers
    Li, Hui
    Li, Jiawen
    Liu, Hai
    Liu, Tingting
    Chen, Qiang
    You, Xinge
    SENSORS, 2024, 24 (17)
  • [28] CPA Performance Enhancement based on Spectrogram
    Kim, Min Ku
    Ryoo, Jeong Choon
    Han, Dong-Guk
    Yi, Okyeon
    46TH ANNUAL 2012 IEEE INTERNATIONAL CARNAHAN CONFERENCE ON SECURITY TECHNOLOGY, 2012, : 195 - 200
  • [29] Waveform-Domain Speech Enhancement Using Spectrogram Encoding for Robust Speech Recognition
    Shi, Hao
    Mimura, Masato
    Kawahara, Tatsuya
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 3049 - 3060