WAVE-TACOTRON: SPECTROGRAM-FREE END-TO-END TEXT-TO-SPEECH SYNTHESIS

被引：51

作者：

Weiss, Ron J. ^{[1
]}

Skerry-Ryan, R. J. ^{[1
]}

Battenberg, Eric ^{[1
]}

Mariooryad, Soroosh ^{[1
]}

Kingma, Diederik P. ^{[1
]}

机构：

[1] Google Res, Mountain View, CA 94043 USA

来源：

2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021) | 2021年

关键词：

text-to-speech; audio synthesis; normalizing flow;

D O I：

10.1109/ICASSP39728.2021.9413851

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

We describe a sequence-to-sequence neural network which directly generates speech waveforms from text inputs. The architecture extends the Tacotron model by incorporating a normalizing flow into the autoregressive decoder loop. Output waveforms are modeled as a sequence of non-overlapping fixed-length blocks, each one containing hundreds of samples. The interdependencies of waveform samples within each block are modeled using the normalizing flow, enabling parallel training and synthesis. Longer-term dependencies are handled autoregressively by conditioning each flow on preceding blocks. This model can be optimized directly with maximum likelihood, without using intermediate, hand-designed features nor additional loss terms. Contemporary state-of-the-art text-to-speech (TTS) systems use a cascade of separately learned models: one (such as Tacotron) which generates intermediate features (such as spectrograms) from text, followed by a vocoder (such as WaveRNN) which generates waveform samples from the intermediate features. The proposed system, in contrast, does not use a fixed intermediate representation, and learns all parameters end-to-end. Experiments show that the proposed model generates speech with quality approaching a state-of-the-art neural TTS system, with significantly improved generation speed.

引用

页码：5679 / 5683

页数：5

共 35 条

[1] Fast Spectrogram Inversion Using Multi-Head Convolutional Neural Networks [J].

Arik, Sercan O. ;

Jun, Heewoo ;

Diamos, Gregory .

IEEE SIGNAL PROCESSING LETTERS, 2019, 26 (01) :94-98

[2]

Battenberg E, 2020, INT CONF ACOUST SPEE, P6194, DOI [10.1109/ICASSP40776.2020.9054106, 10.1109/icassp40776.2020.9054106]

[3]

Berndt D. J., 1994, Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining, P359

[4]

Binkowski M, 2020, ICLR

[5]

Chang S, 2017, PROCEEDINGS OF THE 2ND INTERNATIONAL CONFERENCE ON RELIABILITY SYSTEMS ENGINEERING (ICRSE 2017)

[6]

Chen N., 2021, INT C LEARN REPR

[7]

Chorowski J, 2015, ADV NEUR IN, V28

[8]

Dinh L., 2015, P 3 INT C LEARN REPR

[9]

Dinh L., 2017, P 5 INT C LEARN REPR

[10]

Donahue J., 2020, INT C LEARN REPR

← 1 2 3 4 →