Knowledge-and-Data-Driven Amplitude Spectrum Prediction for Hierarchical Neural Vocoders

被引:5
作者
Ai, Yang [1 ]
Ling, Zhen-Hua [1 ]
机构
[1] Univ Sci & Technol China, Natl Engn Lab Speech & Language Informat Proc, Hefei, Peoples R China
来源
INTERSPEECH 2020 | 2020年
基金
国家重点研发计划;
关键词
neural vocoder; log amplitude spectrum; source-filter; TTS; SPEECH SYNTHESIS; GENERATION;
D O I
10.21437/Interspeech.2020-1046
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
In our previous work, we have proposed a neural vocoder called HiNet which recovers speech waveforms by predicting amplitude and phase spectra hierarchically from input acoustic features. In HiNet, the amplitude spectrum predictor (ASP) predicts log amplitude spectra (LAS) from input acoustic features. This paper proposes a novel knowledge-and-data-driven ASP (KDD-ASP) to improve the conventional one. First, acoustic features (i.e., F0 and mel-cepstra) pass through a knowledge-driven LAS recovery module to obtain approximate LAS (ALAS). This module is designed based on the combination of STFT and source-filter theory, in which the source part and the filter part are designed based on input F0 and melcepstra, respectively. Then, the recovered ALAS are processed by a data-driven LAS refinement module which consists of multiple trainable convolutional layers to get the final LAS. Experimental results show that the HiNet vocoder using KDD-ASP can achieve higher quality of synthetic speech than that using conventional ASP and the WaveRNN vocoder on a text-to-speech (TTS) task.
引用
收藏
页码:190 / 194
页数:5
相关论文
共 34 条
[1]  
Aaron~ van den Oord Yazhe Li, 2018, PR MACH LEARN RES, P3918
[2]  
Abadi M, 2016, PROCEEDINGS OF OSDI'16: 12TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, P265
[3]  
Adiga N, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P5674, DOI 10.1109/ICASSP.2018.8462393
[4]   A Neural Vocoder With Hierarchical Generation of Amplitude and Phase Spectra for Statistical Parametric Speech Synthesis [J].
Ai, Yang ;
Ling, Zhen-Hua .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2020, 28 :839-851
[5]  
Ai Y, 2019, INT CONF ACOUST SPEE, P7025, DOI [10.1109/ICASSP.2019.8683016, 10.1109/icassp.2019.8683016]
[6]  
Ai Y, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P5659, DOI 10.1109/ICASSP.2018.8461878
[7]  
[Anonymous], 2017, P ICLR
[8]  
[Anonymous], 1939, Bell Labs Record
[9]  
Cui Y, 2018, INTERSPEECH, P2017
[10]  
Fan Yuchen, 2014, 15 ANN C INT SPEECH