On the Comparison of Popular End-to-End Models for Large Scale Speech Recognition

被引：73

作者：

Li, Jinyu ^{[1
]}

Wu, Yu ^{[2
]}

Gaur, Yashesh ^{[1
]}

Wang, Chengyi ^{[2
]}

Zhao, Rui ^{[1
]}

Liu, Shujie ^{[2
]}

机构：

[1] Microsoft Speech & Language Grp, Redmond, WA 98052 USA

[2] Microsoft Res Asia, Beijing, Peoples R China

来源：

INTERSPEECH 2020 | 2020年

关键词：

end-to-end; RNN-transducer; attention-based encoder-decoder; transformer;

D O I：

10.21437/Interspeech.2020-2846

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

Recently, there has been a strong push to transition from hybrid models to end-to-end (E2E) models for automatic speech recognition. Currently, there are three promising E2E methods: recurrent neural network transducer (RNN-T), RNN attention-based encoder-decoder (AED), and Transformer-AED. In this study, we conduct an empirical comparison of RNN-T, RNN-AED, and Transformer-AED models, in both non-streaming and streaming modes. We use 65 thousand hours of Microsoft anonymized training data to train these models. As E2E models are more data hungry, it is better to compare their effectiveness with large amount of training data. To the best of our knowledge, no such comprehensive study has been conducted yet. We show that although AED models are stronger than RNN-T in the non-streaming mode, RNN-T is very competitive in streaming mode if its encoder can be properly initialized. Among all three E2E models, transformer-AED achieved the best accuracy in both streaming and non-streaming mode. We show that both streaming RNN-T and transformer-AED models can obtain better accuracy than a highly-optimized hybrid model.

引用

页码：1 / 5

页数：5

共 50 条

[41] TRIGGERED ATTENTION FOR END-TO-END SPEECH RECOGNITION
Moritz, Niko
Hori, Takaaki
Le Roux, Jonathan
2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 5666 - 5670
[42] An Overview of End-to-End Automatic Speech Recognition
Wang, Dong
Wang, Xiaodong
Lv, Shaohe
SYMMETRY-BASEL, 2019, 11 (08):
[43] A review on speech recognition approaches and challenges for Portuguese: exploring the feasibility of fine-tuning large-scale end-to-end models
Li, Yan
Wang, Yapeng
Hoi, Lap Man
Yang, Dingcheng
Im, Sio-Kei
EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING, 2025, 2025 (01):
[44] KNOWLEDGE TRANSFER FROM LARGE-SCALE PRETRAINED LANGUAGE MODELS TO END-TO-END SPEECH RECOGNIZERS
Kubo, Yotaro
Karita, Shigeki
Bacchiani, Michiel
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 8512 - 8516
[45] End-to-End Speech Recognition in Agglutinative Languages
Mamyrbayev, Orken
Alimhan, Keylan
Zhumazhanov, Bagashar
Turdalykyzy, Tolganay
Gusmanova, Farida
INTELLIGENT INFORMATION AND DATABASE SYSTEMS (ACIIDS 2020), PT II, 2020, 12034 : 391 - 401
[46] End-to-end Korean Digits Speech Recognition
Roh, Jong-hyuk
Cho, Kwantae
Kim, Youngsam
Cho, Sangrae
2019 10TH INTERNATIONAL CONFERENCE ON INFORMATION AND COMMUNICATION TECHNOLOGY CONVERGENCE (ICTC): ICT CONVERGENCE LEADING THE AUTONOMOUS FUTURE, 2019, : 1137 - 1139
[47] Variable Scale Pruning for Transformer Model Compression in End-to-End Speech Recognition
Ben Letaifa, Leila
Rouas, Jean-Luc
ALGORITHMS, 2023, 16 (09)
[48] A large-scale dataset for end-to-end table recognition in the wild
Fan Yang
Lei Hu
Xinwu Liu
Shuangping Huang
Zhenghui Gu
Scientific Data, 10
[49] A large-scale dataset for end-to-end table recognition in the wild
Yang, Fan
Hu, Lei
Liu, Xinwu
Huang, Shuangping
Gu, Zhenghui
SCIENTIFIC DATA, 2023, 10 (01)
[50] SFA: Searching faster architectures for end-to-end automatic speech recognition models
Liu, Yukun
Li, Ta
Zhang, Pengyuan
Yan, Yonghong
COMPUTER SPEECH AND LANGUAGE, 2023, 81

← 1 2 3 4 5 →