Fast Spectrogram Inversion Using Multi-Head Convolutional Neural Networks

被引:62
作者
Arik, Sercan O. [1 ]
Jun, Heewoo [1 ]
Diamos, Gregory [1 ]
机构
[1] Baidu Silicon Valley Artificial Intelligence Lab, Sunnyvale, CA 94089 USA
关键词
reconstruction; deep learning; convolutional neural networks; short-time Fourier transform; spectrogram; time-frequency signal processing; speech synthesis;
D O I
10.1109/LSP.2018.2880284
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
We propose the multi-head convolutional neural network (MCNN) for waveform synthesis from spectrograms. Nonlinear interpolation in MCNN is employed with transposed convolution layers in parallel heads. MCNN enables significantly better utilization of modern multi-core processors than commonly used iterative algorithms like Griffin-Lim, and yields very fast (more than 300 x real time) runtime. For training of MCNN, we use a large-scale speech recognition dataset and losses defined on waveforms that are related to perceptual audio quality. We demonstrate that MCNN constitutes a very promising approach for high-quality speech synthesis, without any iterative algorithms or autoregression in computations.
引用
收藏
页码:94 / 98
页数:5
相关论文
共 23 条
[1]  
[Anonymous], 2018, P INT C LEARN REPR
[2]  
[Anonymous], 2016, P INT C LEARN REPR
[3]  
[Anonymous], 2016, PROC 9 ISCA SPEEC
[4]  
[Anonymous], 2017, ADV NEURAL INFORM PR
[5]  
[Anonymous], 2016, GUID CONVOLUTION ARI
[6]  
Arik S. O., 2018, P NEUR INF PROC SYST
[7]  
Arik S. O., 2017, P INT C MACH LEARN
[8]  
Beauregard GT, 2015, 2015 IEEE INTERNATIONAL CONFERENCE ON DIGITAL SIGNAL PROCESSING (DSP), P427, DOI 10.1109/ICDSP.2015.7251907
[9]  
Donahue C, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P5024, DOI 10.1109/ICASSP.2018.8462581
[10]  
Engel J., 2017, P INT C MACH LEARN