WAVEFFJORD: FFJORD-BASED VOCODER FOR STATISTICAL PARAMETRIC SPEECH SYNTHESIS

被引:0
作者
Wu, Ning-Qian [1 ]
Ling, Zhen-Hua [1 ]
机构
[1] Univ Sci & Technol China, Natl Engn Lab Speech & Language Informat Proc, Hefei, Peoples R China
来源
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING | 2020年
关键词
speech synthesis; vocoder; generative models; ODE; FFJORD;
D O I
10.1109/icassp40776.2020.9053202
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Free-form Jacobian of Reversible Dynamics(FFJORD) is a flow-based invertible generative model defined by ordinary differential equations (ODE). Inspired by WaveGlow, in this paper, we propose WaveFFJORD, a neural vocoder that can synthesize speech waveforms from acoustic features, by combining FFJORD and WaveNet. WaveFFJORD can generate speech waveforms directly by the black-box ODE solvers, without the need for autoregressive structures. Our experimental results show that WaveFFJORD can achieve a smaller model size, lower memory cost, and better speech quality than WaveGlow. Besides, the ODE framework allows users to control the generation speed and quality by setting the error tolerance of the ODE solvers.
引用
收藏
页码:7214 / 7218
页数:5
相关论文
共 20 条
[1]  
Bengio, 2017, P 5 INT C LEARN REPR
[2]  
Chen R. T., 2018, Advances in Neural Information Processing Systems, P6571
[3]  
Dinh L., 2015, INT C LEARN REPR ICL
[4]  
DINH L., 2017, Density estimation using Real NVP
[5]  
GRATHWOHL W., 2019, 7 INT C LEARN REPR I
[6]   A STOCHASTIC ESTIMATOR OF THE TRACE OF THE INFLUENCE MATRIX FOR LAPLACIAN SMOOTHING SPLINES [J].
HUTCHINSON, MF .
COMMUNICATIONS IN STATISTICS-SIMULATION AND COMPUTATION, 1990, 19 (02) :433-450
[7]  
Kalchbrenner N., 2018, PMLR, P2410
[8]   Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction:: Possible role of a repetitive structure in sounds [J].
Kawahara, H ;
Masuda-Katsuse, I ;
de Cheveigné, A .
SPEECH COMMUNICATION, 1999, 27 (3-4) :187-207
[9]  
KINGMA DP, 2018, ADV NEURAL INFORM PR, P10215
[10]   WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications [J].
Morise, Masanori ;
Yokomori, Fumiya ;
Ozawa, Kenji .
IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2016, E99D (07) :1877-1884