SFNet: A Computationally Efficient Source Filter Model Based Neural Speech Synthesis

被引:6
作者
Rao, Achuth M., V [1 ]
Ghosh, Prasanta Kumar [1 ]
机构
[1] Indian Inst Sci, Dept Elect Engn, Bangalore 560012, Karnataka, India
关键词
Neural vocoder; source-filter model; computational complexity; Mel-spectrum; LINEAR PREDICTION; ESTIMATOR;
D O I
10.1109/LSP.2020.3005031
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Recently, neural speech synthesizers have achieved a high-quality synthesis for text-to-speech applications, but a real-time synthesis is possible only in the devices which have high memory and allow large computational complexity. In this work, we reduce the complexity of a speech synthesizer by reformulating the source-filter model of speech where the excitation signal is modeled as a sum of two signals. The first signal contains an impulse train that is computed from the pitch sequence. The second signal is modeled as white noise passed through a filter bank with frequency dependent gains. The parameters of the reformulated source-filter model are predicted using a neural network, referred to as SFNet. The network parameters are learnt by training the network using l(1)-error between the log Mel-spectrum of the predicted waveform and that of the ground-truth waveform. We demonstrate that there is a significant reduction in the memory and computational complexity compared to the state-of-the-art speaker independent neural speech synthesizer without any loss of the naturalness of the synthesized speech.
引用
收藏
页码:1170 / 1174
页数:5
相关论文
共 32 条
[1]  
Abadi M, 2016, PROCEEDINGS OF OSDI'16: 12TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, P265
[2]  
[Anonymous], LPCNET IMPLEMENTATIO
[3]  
[Anonymous], 2018, PROC 35 INT C MACH L
[4]  
[Anonymous], MCGILL U DATABASE VE
[5]  
[Anonymous], METH SUBJ ASS SMALL
[6]  
[Anonymous], 1988, Modern Spectral Estimation
[7]  
Arik SÖ, 2017, ADV NEUR IN, V30
[8]   A sawtooth waveform inspired pitch estimator for speech and music [J].
Camacho, Arturo ;
Harris, John G. .
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2008, 124 (03) :1638-1652
[9]  
Fant G., 1960, ACOUSTIC THEORY SPEE
[10]   SIGNAL ESTIMATION FROM MODIFIED SHORT-TIME FOURIER-TRANSFORM [J].
GRIFFIN, DW ;
LIM, JS .
IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1984, 32 (02) :236-243