A Mel Spectrogram Enhancement Paradigm Based on CWT in Speech Synthesis

被引：0

作者：

Hu, Guoqiang ^{[1
]}

Tan, Huaning ^{[1
]}

Li, Ruilai ^{[1
]}

机构：

[1] Jinan Univ, Int Sch, Guangzhou, Peoples R China

来源：

2024 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING, IALP 2024 | 2024年

关键词：

Mel spectrogram; Speech Synthesis; Fine Grainedness; Continuous Wavelet Transform;

D O I：

10.1109/IALP63756.2024.10661192

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Acoustic features play an important role in improving the quality of the synthesised speech. Currently, the Mel spectrogram is a widely employed acoustic feature in most acoustic models. However, due to the fine-grained loss caused by its Fourier transform process, the clarity of speech synthesised by Mel spectrogram is compromised in mutant signals. In order to obtain a more detailed Mel spectrogram, we propose a Mel spectrogram enhancement paradigm based on the continuous wavelet transform (CWT). This paradigm introduces an additional task: a more detailed wavelet spectrogram, which like the post-processing network takes as input the Mel spectrogram output by the decoder. We choose Tacotron2 and Fastspeech2 for experimental validation in order to test autoregressive (AR) and non-autoregressive (NAR) speech systems, respectively. The experimental results demonstrate that the speech synthesised using the model with the Mel spectrogram enhancement paradigm exhibits higher MOS, with an improvement of 0.14 and 0.09 compared to the baseline model, respectively. These findings provide some validation for the universality of the enhancement paradigm, as they demonstrate the success of the paradigm in different architectures.

引用

收藏

页码：401 / 405

页数：5

相关论文

共 50 条

[21] Underwater acoustic target recognition based on sub-band concatenated Mel spectrogram and multidomain attention mechanism [J].

Yang, Shuang ;

Jin, Anqi ;

Zeng, Xiangyang ;

Wang, Haitao ;

Hong, Xi ;

Lei, Menghui .

ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2024, 133

[22] Research on automatic assessment of the severity of unilateral vocal cord paralysis based on Mel-spectrogram and convolutional neural networks [J].

Shuaichi Ma ;

Wenwen Liao ;

Yi Zhang ;

Fan Zhang ;

Yimiao Wang ;

Zhiyan Lu ;

Chen Zhao ;

Jianbo Yu ;

Peijie He .

BioMedical Engineering OnLine, 24 (1)

[23] TranStutter: A Convolution-Free Transformer-Based Deep Learning Method to Classify Stuttered Speech Using 2D Mel-Spectrogram Visualization and Attention-Based Feature Representation [J].

Basak, Krishna ;

Mishra, Nilamadhab ;

Chang, Hsien-Tsung .

SENSORS, 2023, 23 (19)

[24] Speech Enhancement for Noise-Robust Speech Synthesis using Wasserstein GAN [J].

Adiga, Nagaraj ;

Pantazis, Yannis ;

Tsiaras, Vassilis ;

Stylianou, Yannis .

INTERSPEECH 2019, 2019, :1821-1825

[25] Natural Text-to-Speech Synthesis by Conditioning Spectrogram Predictions from Transformer Network on WaveGlow Vocoder [J].

Sanjay, G. ;

Sooraj, K. C. ;

Mishra, Deepak .

2020 7TH INTERNATIONAL CONFERENCE ON SOFT COMPUTING & MACHINE INTELLIGENCE (ISCMI 2020), 2020, :255-259

[26] SIGNAL RECONSTRUCTION FROM MEL-SPECTROGRAM BASED ON BI-LEVEL CONSISTENCY OF FULL-BAND MAGNITUDE AND PHASE [J].

Masuyama, Yoshiki ;

Ueno, Natsuki ;

Ono, Nobutaka .

2023 IEEE WORKSHOP ON APPLICATIONS OF SIGNAL PROCESSING TO AUDIO AND ACOUSTICS, WASPAA, 2023,

[27] Phoneme Aware Speech Synthesis via Fine Tune Transfer Learning with a Tacotron Spectrogram Prediction Network [J].

Bird, Jordan J. ;

Ekart, Aniko ;

Faria, Diego R. .

ADVANCES IN COMPUTATIONAL INTELLIGENCE SYSTEMS (UKCI 2019), 2020, 1043 :271-282

[28] Speech emotion recognition based on optimized deep features of dual-channel complementary spectrogram [J].

Li, Juan ;

Zhang, Xueying ;

Li, Fenglian ;

Huang, Lixia .

INFORMATION SCIENCES, 2023, 649

[29] Considering Global Variance of the Log Power Spectrum Derived from Mel-Cepstrum in HMM-based Parametric Speech Synthesis [J].

Yin, Xiang ;

Ling, Zhen-Hua ;

Lei, Ming ;

Dai, Li-Rong .

13TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2012 (INTERSPEECH 2012), VOLS 1-3, 2012, :1146-1149

[30] Bi-directional LSTM-based isolated spoken word recognition for Kashmiri language utilizing Mel-spectrogram feature [J].

Dar, Muzaffar Ahmad ;

Pushparaj, Jagalingam .

APPLIED ACOUSTICS, 2025, 231

← 1 2 3 4 5 →