Neural Homomorphic Vocoder

被引:21
作者
Liu, Zhijun [1 ]
Chen, Kuan [1 ]
Yu, Kai [1 ]
机构
[1] Shanghai Jiao Tong Univ, MoE Key Lab Artificial Intelligence, SpeechLab, Dept Comp Sci & Engn,AI Inst, Shanghai, Peoples R China
来源
INTERSPEECH 2020 | 2020年
关键词
speech synthesis; source-filter model; harmonic-plus-noise model; waveform model; SPEECH SYNTHESIS; SYNTHESIS SYSTEM;
D O I
10.21437/Interspeech.2020-3188
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
In this paper, we propose the neural homomorphic vocoder (NHV), a source-filter model based neural vocoder framework. NHV synthesizes speech by filtering impulse trains and noise with linear time-varying (LTV) filters. A neural network controls the LTV filters by estimating complex cepstrums of time-varying impulse responses given acoustic features. The proposed framework can be trained with a combination of multi-resolution STFT loss and adversarial loss functions. Due to the use of DSP-based synthesis methods, NHV is highly efficient, fully controllable and interpretable. A vocoder was built under the framework to synthesize speech given log-Mel spectrograms and fundamental frequencies. While the model cost only 15 kFLOPs per sample, the synthesis quality remained comparable to baseline neural vocoders in both copy-synthesis and text-to-speech.
引用
收藏
页码:240 / 244
页数:5
相关论文
共 30 条
[1]  
Binkowski M, 2019, Arxiv, DOI arXiv:1909.11646
[2]  
Engel J. H., 2020, ARXIV200104643, P1, DOI DOI 10.48550/ARXIV.2001.04643
[3]  
Goodfellow IJ, 2014, ADV NEUR IN, V27, P2672
[4]  
Griffin D., 1985, ICASSP 85, V10, P513
[5]   GELP: GAN-Excited Linear Prediction for Speech Synthesis from Mel-spectrogram [J].
Juvela, Lauri ;
Bollepalli, Bajibabu ;
Yamagishi, Junichi ;
Alku, Paavo .
INTERSPEECH 2019, 2019, :694-698
[6]  
Juvela L, 2019, INT CONF ACOUST SPEE, P6915, DOI [10.1109/icassp.2019.8683271, 10.1109/ICASSP.2019.8683271]
[7]  
Kalchbrenner N, 2018, PR MACH LEARN RES, V80
[8]   Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction:: Possible role of a repetitive structure in sounds [J].
Kawahara, H ;
Masuda-Katsuse, I ;
de Cheveigné, A .
SPEECH COMMUNICATION, 1999, 27 (3-4) :187-207
[9]  
Kim S, 2019, Arxiv, DOI arXiv:1811.02155
[10]  
Kumar K., 2019, P NIPS