Synthesizing waveform sequence-to-sequence to augment training data for sequence-to-sequence speech recognition

被引:1
作者
Ueno, Sei [1 ]
Mimura, Masato [1 ]
Sakai, Shinsuke [1 ]
Kawahara, Tatsuya [1 ]
机构
[1] Kyoto Univ, Grad Sch Informat, Sakyo Ku, Kyoto 6068501, Japan
关键词
Speech recognition; Sequence-to-sequence model; Attention-based encoder-decoder model; Speech synthesis; Data augmentation; SIGNAL;
D O I
10.1250/ast.42.333
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Sequence-to-sequence (seq2seq) automatic speech recognition (ASR) recently achieves state-of-the-art performance with fast decoding and a simple architecture. On the other hand, it requires a large amount of training data and cannot use text-only data for training. In our previous work, we proposed a method for applying text data to seq2seq ASR training by leveraging text-to-speech (TTS). However, we observe the log Mel-scale filterbank (lmfb) features produced by Tacotron 2-based model are blurry, particularly on the time dimension. This problem is mitigated by introducing the WaveNet vocoder to generate speech of better quality or spectrogram of better time-resolution. This makes it possible to train waveform-input end-to-end ASR. Here we use CNN filters and apply a masking method similar to SpecAugment. We compare the waveform-input model with two kinds of lmfb-input models: (1) lmfb features are directly generated by TTS, and (2) lmfb features are converted from the waveform generated by TTS. Experimental evaluations show the combination of waveform-output TTS and the waveform-input end-to-end ASR model outperforms the lmfb-input models in two domain adaptation settings.
引用
收藏
页码:333 / 343
页数:11
相关论文
共 50 条
[1]  
[Anonymous], 2017, 171100354 ARXIV
[2]  
Arik SÖ, 2017, ADV NEUR IN, V30
[3]  
Bandanau D, 2016, INT CONF ACOUST SPEE, P4945, DOI 10.1109/ICASSP.2016.7472618
[4]  
Battenberg E, 2017, 2017 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), P206, DOI 10.1109/ASRU.2017.8268937
[5]  
Bengio Y., 2014, P NIPS WORKSH DEEP L
[6]  
Chan W, 2016, INT CONF ACOUST SPEE, P4960, DOI 10.1109/ICASSP.2016.7472621
[7]  
Chiu CC, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P4774, DOI 10.1109/ICASSP.2018.8462105
[8]  
Chorowski J, 2015, ADV NEUR IN, V28
[9]  
Dong LH, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P5884, DOI 10.1109/ICASSP.2018.8462506
[10]  
Graves A., 2006, P 23 INT C MACH LEAR, P369