EXPLORING PRE-TRAINING WITH ALIGNMENTS FOR RNN TRANSDUCER BASED END-TO-END SPEECH RECOGNITION

被引：0

作者：

Hu, Hu ^{[1
,2
]}

Zhao, Rui ^{[1
]}

Li, Jinyu ^{[1
]}

Lu, Liang ^{[1
]}

Gong, Yifan ^{[1
]}

机构：

[1] Microsoft Speech & Language Grp, Redmond, WA 98052 USA

[2] Georgia Inst Technol, Atlanta, GA 30332 USA

来源：

2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING | 2020年

关键词：

RNN transducer; end-to-end; alignments; speech recognition; pre-training; ATTENTION;

D O I：

10.1109/icassp40776.2020.9054663

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Recently, the recurrent neural network transducer (RNN-T) architecture has become an emerging trend in end-to-end automatic speech recognition research due to its advantages of being capable for online streaming speech recognition. However, RNN-T training is made difficult by the huge memory requirements, and complicated neural structure. A common solution to ease the RNN-T training is to employ connectionist temporal classification (CTC) model along with RNN language model (RNNLM) to initialize the RNN-T parameters. In this work, we conversely leverage external alignments to seed the RNN-T model. Two different pre-training solutions are explored, referred to as encoder pre-training, and whole-network pre-training respectively. Evaluated on Microsoft 65,000 hours anonymized production data with personally identifiable information removed, our proposed methods can obtain significant improvement. In particular, the encoder pre-training solution achieved a 10% and a 8% relative word error rate reduction when compared with random initialization and the widely used CTC+RNNLM initialization strategy, respectively. Our solutions also significantly reduce the RNN-T model latency from the baseline.

引用

页码：7079 / 7083

页数：5

共 30 条

[1]

[Anonymous], 2015, ARXIV PREPRINT ARXIV

[2]

[Anonymous], 2015, PROC NEURIPS, P577, DOI DOI 10.1016/0167-739X(94)90007-8

[3]

Bahdanau D, 2016, Arxiv, DOI arXiv:1409.0473

[4]

Bandanau D, 2016, INT CONF ACOUST SPEE, P4945, DOI 10.1109/ICASSP.2016.7472618

[5]

Battenberg E, 2017, 2017 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), P206, DOI 10.1109/ASRU.2017.8268937

[6]

Chan W, 2016, INT CONF ACOUST SPEE, P4960, DOI 10.1109/ICASSP.2016.7472621

[7]

Chiu C.-C., 2017, ARXIV171205382

[8]

Chiu CC, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P4774, DOI 10.1109/ICASSP.2018.8462105

[9]

Cho K., 2014, CORR, P103, DOI 10.3115/v1/W14-4012

[10] Advancing Acoustic-to-Word CTC Model With Attention and Mixed-Units [J].

Das, Amit ;

Li, Jinyu ;

Ye, Guoli ;

Zhao, Rui ;

Gong, Yifan .

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2019, 27 (12) :1880-1892

← 1 2 3 →