DEVELOPING REAL-TIME STREAMING TRANSFORMER TRANSDUCER FOR SPEECH RECOGNITION ON LARGE-SCALE DATASET

被引：101

作者：

Chen, Xie ^{[1
]}

Wu, Yu ^{[2
]}

Wang, Zhenghao ^{[1
]}

Liu, Shujie ^{[2
]}

Li, Jinyu ^{[1
]}

机构：

[1] Microsoft Speech & Language Grp, Hangzhou, Peoples R China

[2] Microsoft Res Asia, Hangzhou, Peoples R China

来源：

2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021) | 2021年

关键词：

Transformer; Transducer; Real-time decoding; Speech Recognition;

D O I：

10.1109/ICASSP39728.2021.9413535

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Recently, Transformer based end-to-end models have achieved great success in many areas including speech recognition. However, compared to LSTM models, the heavy computational cost of the Transformer during inference is a key issue to prevent their applications. In this work, we explored the potential of Transformer Transducer (T-T) models for the fist pass decoding with low latency and fast speed on a large-scale dataset. We combine the idea of Transformer-XL and chunk-wise streaming processing to design a streamable Transformer Transducer model. We demonstrate that T-T outperforms the hybrid model, RNN Transducer (RNN-T), and streamable Transformer attention-based encoder-decoder model in the streaming scenario. Furthermore, the runtime cost and latency can be optimized with a relatively small look-ahead.

引用

页码：5904 / 5908

页数：5

共 36 条

[1]

[Anonymous], 2012, Sequence transduction with recurrent neural networks

[2]

Bahdanau D, 2016, Arxiv, DOI [arXiv:1409.0473, DOI 10.48550/ARXIV.1409.0473]

[3]

Battenberg E, 2017, 2017 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), P206, DOI 10.1109/ASRU.2017.8268937

[4]

Chan W, 2016, INT CONF ACOUST SPEE, P4960, DOI 10.1109/ICASSP.2016.7472621

[5]

Chiu C. C., 2018, 6 INT C LEARN REPR I

[6]

Chiu CC, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P4774, DOI 10.1109/ICASSP.2018.8462105

[7]

Chorowski J, 2015, ADV NEUR IN, V28

[8]

Dai ZH, 2019, 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), P2978

[9]

Dong LH, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P5884, DOI 10.1109/ICASSP.2018.8462506

[10]

Graves Alex, 2012, ARXIV12113711

← 1 2 3 4 →