SIMPLEFLAT: A SIMPLE WHOLE-NETWORK PRE-TRAINING APPROACH FOR RNN TRANSDUCER-BASED END-TO-END SPEECH RECOGNITION

被引：7

作者：

Moriya, Takafumi ^{[1
]}

Ashihara, Takanori ^{[1
]}

Tanaka, Tomohiro ^{[1
]}

Ochiai, Tsubasa ^{[1
]}

Sato, Hiroshi ^{[1
]}

Ando, Atsushi ^{[1
]}

Ijima, Yusuke ^{[1
]}

Masumura, Ryo ^{[1
]}

Shinohara, Yusuke ^{[1
]}

机构：

[1] NTT Corp, Tokyo, Japan

来源：

2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021) | 2021年

关键词：

speech recognition; neural network; end-to-end; recurrent neural network-transducer; whole-network pre-training;

D O I：

10.1109/ICASSP39728.2021.9413741

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Recurrent neural network-transducer (RNN-T) is promising for building time-synchronous end-to-end automatic speech recognition (ASR) systems, in part because it does not need frame-wise alignment between input features and target labels in the training step. Although training without alignment is beneficial, it makes it difficult to discern the relation between input features and output token sequences. This, in effect, degrades RNN-T performance. Our solution is SimpleFlat (SF), a novel and simple whole-network pre-training approach for RNN-T. SF extracts frame-wise alignments on-the-fly from the training dataset, and does not require any external resources. We distribute equal numbers of target tokens to each frame following RNN-T encoder output lengths by repeating each token. The frame-wise tokens so created are shifted, and also used as the prediction network inputs. Therefore, SF can be implemented by cross entropy loss computation as in autoregressive model training. Experiments on Japanese and English ASR tasks demonstrate that SF can effectively improve various RNN-T architectures.

引用

页码：5664 / 5668

页数：5

共 29 条

[1] Long short-term memory
Hochreiter, S
Schmidhuber, J
[J]. NEURAL COMPUTATION, 1997, 9 (08) : 1735 - 1780
[2] [Anonymous], 2015, IEEE I CONF COMP VIS, DOI DOI 10.1109/ICCV.2015.123
[3] [Anonymous], 2020, P ICASSP
[4] Chorowski J, 2015, ADV NEUR IN, V28
[5] Chorowski Jan, 2014, ADV NIPS
[6] Ghodsi M, 2020, INT CONF ACOUST SPEE, P7049, DOI [10.1109/ICASSP40776.2020.9054419, 10.1109/icassp40776.2020.9054419]
[7] Graves A., 2006, INT C MACH LEARN
[8] Graves A, 2014, PR MACH LEARN RES, V32, P1764
[9] Graves A, 2013, INT CONF ACOUST SPEE, P6645, DOI 10.1109/ICASSP.2013.6638947
[10] Graves Alex, 2012, CoRR

← 1 2 3 →