Continuous Wavelet Vocoder-based Decomposition of Parametric Speech Waveform Synthesis

被引：3

作者：

Al-Radhi, Mohammed Salah ^{[1
]}

Csapo, Tamas Gabor ^{[1
,2
]}

Zainko, Csaba ^{[1
]}

Nemeth, Geza ^{[1
]}

机构：

[1] Budapest Univ Technol & Econ, Dept Telecommun & Media Informat, Budapest, Hungary

[2] MTA ELTE Lendulet Lingual Articulat Res Grp, Budapest, Hungary

来源：

INTERSPEECH 2021 | 2021年

关键词：

wavelet model; speech synthesis; continuous vocoder; statistical features;

D O I：

10.21437/Interspeech.2021-1600

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

To date, various speech technology systems have adopted the vocoder approach, a method for synthesizing speech waveform that shows a major role in the performance of statistical parametric speech synthesis. However, conventional source-filter systems (i.e., STRAIGHT) and sinusoidal models (i.e., MagPhase) tend to produce over-smoothed spectra, which often result in muffled and buzzy synthesized text-to-speech (TTS). WaveNet, one of the best models that nearly resembles the human voice, has to generate a waveform in a time-consuming sequential manner with an extremely complex structure of its neural networks. WaveNet needs large quantities of voice data before accurate predictions can be obtained. In order to motivate a new, alternative approach to these issues, we present an updated synthesizer, which is a simple signal model to train and easy to generate waveforms, using Continuous Wavelet Transform (CWT) to characterize and decompose speech features. CWT provides time and frequency resolutions different from those of the short-time Fourier transform. It can also retain the fine spectral envelope and achieve high controllability of the structure closer to human auditory scales. We confirmed through experiments that our speech synthesis system was able to provide natural-sounding synthetic speech and outperformed the state-of-the-art WaveNet vocoder.

引用

页码：2212 / 2216

页数：5

共 30 条

[1] Time-domain envelope modulating the noise component of excitation in a continuous residual-based vocoder for statistical parametric speech synthesis [J].

Al-Radhi, Mohammed Salah ;

Csapo, Tamas Gabor ;

Nemeth, Geza .

18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, :434-438

[2]

[Anonymous], 1983, Electronics and Communications in Japan (Part I: Communications), DOI [10.1002/ecja.4400660203, DOI 10.1002/ECJA.4400660203]

[3]

[Anonymous], 2001, BS1534 ITU R

[4]

[Anonymous], 2015, REAPER: Robust Epoch And Pitch EstimatoR

[5] Wavelet speech enhancement based on the Teager Energy operator [J].

Bahoura, M ;

Rouat, J .

IEEE SIGNAL PROCESSING LETTERS, 2001, 8 (01) :10-12

[6] THE WAVELET TRANSFORM, TIME-FREQUENCY LOCALIZATION AND SIGNAL ANALYSIS [J].

DAUBECHIES, I .

IEEE TRANSACTIONS ON INFORMATION THEORY, 1990, 36 (05) :961-1005

[7]

Daubechies Ingrid, 1992, Ten Lectures in Wavelets, DOI DOI 10.1137/1.9781611970104

[8] Maximum voiced frequency estimation: Exploiting amplitude and phase spectra [J].

Drugman, Thomas ;

Stylianou, Yannis .

IEEE Signal Processing Letters, 2014, 21 (10) :1230-1234

[9] A Simple Continuous Pitch Estimation Algorithm [J].

Garner, Philip N. ;

Cernak, Milos ;

Motlicek, Petr .

IEEE SIGNAL PROCESSING LETTERS, 2013, 20 (01) :102-105

[10]

Hochreiter S, 1997, NEURAL COMPUT, V9, P1735, DOI [10.1162/neco.1997.9.8.1735, 10.1162/neco.1997.9.1.1, 10.1007/978-3-642-24797-2]

← 1 2 3 →