Synthesizing waveform sequence-to-sequence to augment training data for sequence-to-sequence speech recognition

被引：1

作者：

Ueno, Sei ^{[1
]}

Mimura, Masato ^{[1
]}

Sakai, Shinsuke ^{[1
]}

Kawahara, Tatsuya ^{[1
]}

机构：

[1] Kyoto Univ, Grad Sch Informat, Sakyo Ku, Kyoto 6068501, Japan

来源：

ACOUSTICAL SCIENCE AND TECHNOLOGY | 2021年 / 42卷 / 06期

关键词：

Speech recognition; Sequence-to-sequence model; Attention-based encoder-decoder model; Speech synthesis; Data augmentation; SIGNAL;

D O I：

10.1250/ast.42.333

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Sequence-to-sequence (seq2seq) automatic speech recognition (ASR) recently achieves state-of-the-art performance with fast decoding and a simple architecture. On the other hand, it requires a large amount of training data and cannot use text-only data for training. In our previous work, we proposed a method for applying text data to seq2seq ASR training by leveraging text-to-speech (TTS). However, we observe the log Mel-scale filterbank (lmfb) features produced by Tacotron 2-based model are blurry, particularly on the time dimension. This problem is mitigated by introducing the WaveNet vocoder to generate speech of better quality or spectrogram of better time-resolution. This makes it possible to train waveform-input end-to-end ASR. Here we use CNN filters and apply a masking method similar to SpecAugment. We compare the waveform-input model with two kinds of lmfb-input models: (1) lmfb features are directly generated by TTS, and (2) lmfb features are converted from the waveform generated by TTS. Experimental evaluations show the combination of waveform-output TTS and the waveform-input end-to-end ASR model outperforms the lmfb-input models in two domain adaptation settings.

引用

页码：333 / 343

页数：11

共 50 条

[1]

[Anonymous], 2017, 171100354 ARXIV

[2]

Arik SÖ, 2017, ADV NEUR IN, V30

[3]

Bandanau D, 2016, INT CONF ACOUST SPEE, P4945, DOI 10.1109/ICASSP.2016.7472618

[4]

Battenberg E, 2017, 2017 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), P206, DOI 10.1109/ASRU.2017.8268937

[5]

Bengio Y., 2014, P NIPS WORKSH DEEP L

[6]

Chan W, 2016, INT CONF ACOUST SPEE, P4960, DOI 10.1109/ICASSP.2016.7472621

[7]

Chiu CC, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P4774, DOI 10.1109/ICASSP.2018.8462105

[8]

Chorowski J, 2015, ADV NEUR IN, V28

[9]

Dong LH, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P5884, DOI 10.1109/ICASSP.2018.8462506

[10]

Graves A., 2006, P 23 INT C MACH LEAR, P369

← 1 2 3 4 5 →