A Mel Spectrogram Enhancement Paradigm Based on CWT in Speech Synthesis

被引：0

作者：

Hu, Guoqiang ^{[1
]}

Tan, Huaning ^{[1
]}

Li, Ruilai ^{[1
]}

机构：

[1] Jinan Univ, Int Sch, Guangzhou, Peoples R China

来源：

2024 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING, IALP 2024 | 2024年

关键词：

Mel spectrogram; Speech Synthesis; Fine Grainedness; Continuous Wavelet Transform;

D O I：

10.1109/IALP63756.2024.10661192

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Acoustic features play an important role in improving the quality of the synthesised speech. Currently, the Mel spectrogram is a widely employed acoustic feature in most acoustic models. However, due to the fine-grained loss caused by its Fourier transform process, the clarity of speech synthesised by Mel spectrogram is compromised in mutant signals. In order to obtain a more detailed Mel spectrogram, we propose a Mel spectrogram enhancement paradigm based on the continuous wavelet transform (CWT). This paradigm introduces an additional task: a more detailed wavelet spectrogram, which like the post-processing network takes as input the Mel spectrogram output by the decoder. We choose Tacotron2 and Fastspeech2 for experimental validation in order to test autoregressive (AR) and non-autoregressive (NAR) speech systems, respectively. The experimental results demonstrate that the speech synthesised using the model with the Mel spectrogram enhancement paradigm exhibits higher MOS, with an improvement of 0.14 and 0.09 compared to the baseline model, respectively. These findings provide some validation for the universality of the enhancement paradigm, as they demonstrate the success of the paradigm in different architectures.

引用

页码：401 / 405

页数：5

共 50 条

[31] A pattern recognition based esophageal speech enhancement system [J].

Mantilla-Caeiros, A. ;

Nakano-Miyatake, M. ;

Perez-Meana, H. .

JOURNAL OF APPLIED RESEARCH AND TECHNOLOGY, 2010, 8 (01) :56-71

[32] Arabic Speech Synthesis based on HMM [J].

Khalil, Krichi Mohamed ;

Adnan, Cherif .

2018 15TH INTERNATIONAL MULTI-CONFERENCE ON SYSTEMS, SIGNALS AND DEVICES (SSD), 2018, :1091-1095

[33] KALMAN FILTER BASED SPEECH SYNTHESIS [J].

Quillen, Carl .

2010 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2010, :4618-4621

[34] Synthesis of everyday conversational speech based on fine-tuning with a corpus for speech synthesis [J].

Mori, Hiroki ;

Furukawa, Kota .

ACOUSTICAL SCIENCE AND TECHNOLOGY, 2025, 46 (01) :103-105

[35] SPEECH ENHANCEMENT IN CAR NOISE ENVIRONMENT BASED ON AN ANALYSIS-SYNTHESIS APPROACH USING HARMONIC NOISE MODEL [J].

Chen, R. F. ;

Chan, C. F. ;

So, H. C. ;

Lee, Jonathan S. C. ;

Leung, C. Y. .

2009 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS 1- 8, PROCEEDINGS, 2009, :4413-+

[36] Inventory-Based Audio-Visual Speech Enhancement [J].

Kolossa, Dorothea ;

Nickel, Robert ;

Zeiler, Steffen ;

Martin, Rainer .

13TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2012 (INTERSPEECH 2012), VOLS 1-3, 2012, :586-589

[37] Neural Speech Embeddings for Speech Synthesis Based on Deep Generative Networks [J].

Lee, Seo-Hyun ;

Lee, Young-Eun ;

Kim, Soowon ;

Ko, Byung-Kwan ;

Kim, Jun-Young .

2024 12TH INTERNATIONAL WINTER CONFERENCE ON BRAIN-COMPUTER INTERFACE, BCI 2024, 2024,

[38] Design and implement of game speech interaction based on speech synthesis technique [J].

Wang, Xujie ;

Yun, Ruwei .

TECHNOLOGIES FOR E-LEARNING AND DIGITAL ENTERTAINMENT, PROCEEDINGS, 2008, 5093 :371-380

[39] EmoSRE: Emotion prediction based speech synthesis and refined speech recognition using large language model and prosody encoding [J].

Akhouri, Shivam ;

Balasundaram, Ananthakrishnan .

CURRENT PSYCHOLOGY, 2025, :7250-7262

[40] Robustness of HMM-based Speech Synthesis [J].

Yamagishi, Junichi ;

Ling, Zhenhua ;

King, Simon .

INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, 2008, :581-584

← 1 2 3 4 5 →