Neural Homomorphic Vocoder

被引：21

作者：

Liu, Zhijun ^{[1
]}

Chen, Kuan ^{[1
]}

Yu, Kai ^{[1
]}

机构：

[1] Shanghai Jiao Tong Univ, MoE Key Lab Artificial Intelligence, SpeechLab, Dept Comp Sci & Engn,AI Inst, Shanghai, Peoples R China

来源：

INTERSPEECH 2020 | 2020年

关键词：

speech synthesis; source-filter model; harmonic-plus-noise model; waveform model; SPEECH SYNTHESIS; SYNTHESIS SYSTEM;

D O I：

10.21437/Interspeech.2020-3188

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

In this paper, we propose the neural homomorphic vocoder (NHV), a source-filter model based neural vocoder framework. NHV synthesizes speech by filtering impulse trains and noise with linear time-varying (LTV) filters. A neural network controls the LTV filters by estimating complex cepstrums of time-varying impulse responses given acoustic features. The proposed framework can be trained with a combination of multi-resolution STFT loss and adversarial loss functions. Due to the use of DSP-based synthesis methods, NHV is highly efficient, fully controllable and interpretable. A vocoder was built under the framework to synthesize speech given log-Mel spectrograms and fundamental frequencies. While the model cost only 15 kFLOPs per sample, the synthesis quality remained comparable to baseline neural vocoders in both copy-synthesis and text-to-speech.

引用

页码：240 / 244

页数：5

共 30 条

[1]

Binkowski M, 2019, Arxiv, DOI arXiv:1909.11646

[2]

Engel J. H., 2020, ARXIV200104643, P1, DOI DOI 10.48550/ARXIV.2001.04643

[3]

Goodfellow IJ, 2014, ADV NEUR IN, V27, P2672

[4]

Griffin D., 1985, ICASSP 85, V10, P513

[5] GELP: GAN-Excited Linear Prediction for Speech Synthesis from Mel-spectrogram [J].

Juvela, Lauri ;

Bollepalli, Bajibabu ;

Yamagishi, Junichi ;

Alku, Paavo .

INTERSPEECH 2019, 2019, :694-698

[6]

Juvela L, 2019, INT CONF ACOUST SPEE, P6915, DOI [10.1109/icassp.2019.8683271, 10.1109/ICASSP.2019.8683271]

[7]

Kalchbrenner N, 2018, PR MACH LEARN RES, V80

[8] Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction:: Possible role of a repetitive structure in sounds [J].

Kawahara, H ;

Masuda-Katsuse, I ;

de Cheveigné, A .

SPEECH COMMUNICATION, 1999, 27 (3-4) :187-207

[9]

Kim S, 2019, Arxiv, DOI arXiv:1811.02155

[10]

Kumar K., 2019, P NIPS

← 1 2 3 →