WG-WaveNet: Real-Time High-Fidelity Speech Synthesis without GPU

被引：5

作者：

Hsu, Po-chun ^{[1
,2
]}

Lee, Hung-yi ^{[1
,2
]}

机构：

[1] Natl Taiwan Univ, Coll Elect Engn & Comp Sci, Taipei, Taiwan

[2] Natl Taiwan Univ, Grad Inst Commun Engn, Taipei, Taiwan

来源：

INTERSPEECH 2020 | 2020年

关键词：

neural vocoder; raw waveform synthesis; text-to-speech;

D O I：

10.21437/Interspeech.2020-1736

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

In this paper, we propose WG-WaveNet, a fast, lightweight, and high-quality waveform generation model. WG-WaveNet is composed of a compact flow-based model and a post-filter. The two components are jointly trained by maximizing the likelihood of the training data and optimizing loss functions on the frequency domains. As we design a flow-based model that is heavily compressed, the proposed model requires much less computational resources compared to other waveform generation models during both training and inference time; even though the model is highly compressed, the post-filter maintains the quality of generated waveform. Our PyTorch implementation can be trained using less than 8 GB GPU memory and generates audio samples at a rate of more than 960 kHz on an NVIDIA 1080Ti GPU. Furthermore, even if synthesizing on a CPU, we show that the proposed method is capable of generating 44.1 kHz speech waveform 1.2 times faster than real-time. Experiments also show that the quality of generated audio is comparable to those of other methods. Audio samples are publicly available online.

引用

页码：210 / 214

页数：5

共 24 条

[1]

[Anonymous], 2017, The LJ Speech Dataset

[2] Fast Spectrogram Inversion Using Multi-Head Convolutional Neural Networks [J].

Arik, Sercan O. ;

Jun, Heewoo ;

Diamos, Gregory .

IEEE SIGNAL PROCESSING LETTERS, 2019, 26 (01) :94-98

[3]

Chou JC, 2018, Arxiv, DOI arXiv:1804.02812

[4]

Dinh L, 2017, Arxiv, DOI [arXiv:1605.08803, 10.48550/arXiv.1605.08803 1605.08803]

[5]

Jin ZY, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P2251, DOI 10.1109/ICASSP.2018.8462431

[6]

Kalchbrenner N, 2018, Arxiv, DOI arXiv:1802.08435

[7]

King DB, 2015, ACS SYM SER, V1214, P1, DOI 10.1021/bk-2015-1214.ch001

[8]

Kingma DP, 2018, 32 C NEURAL INFORM P

[9]

KUBICHEK RF, 1993, IEEE PACIF, P125, DOI 10.1109/PACRIM.1993.407206

[10]

Lan ZZ, 2020, Arxiv, DOI [arXiv:1909.11942, DOI 10.48550/ARXIV.1909.11942]

← 1 2 3 →