On the Comparison of Popular End-to-End Models for Large Scale Speech Recognition

被引：73

作者：

Li, Jinyu ^{[1
]}

Wu, Yu ^{[2
]}

Gaur, Yashesh ^{[1
]}

Wang, Chengyi ^{[2
]}

Zhao, Rui ^{[1
]}

Liu, Shujie ^{[2
]}

机构：

[1] Microsoft Speech & Language Grp, Redmond, WA 98052 USA

[2] Microsoft Res Asia, Beijing, Peoples R China

来源：

INTERSPEECH 2020 | 2020年

关键词：

end-to-end; RNN-transducer; attention-based encoder-decoder; transformer;

D O I：

10.21437/Interspeech.2020-2846

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

Recently, there has been a strong push to transition from hybrid models to end-to-end (E2E) models for automatic speech recognition. Currently, there are three promising E2E methods: recurrent neural network transducer (RNN-T), RNN attention-based encoder-decoder (AED), and Transformer-AED. In this study, we conduct an empirical comparison of RNN-T, RNN-AED, and Transformer-AED models, in both non-streaming and streaming modes. We use 65 thousand hours of Microsoft anonymized training data to train these models. As E2E models are more data hungry, it is better to compare their effectiveness with large amount of training data. To the best of our knowledge, no such comprehensive study has been conducted yet. We show that although AED models are stronger than RNN-T in the non-streaming mode, RNN-T is very competitive in streaming mode if its encoder can be properly initialized. Among all three E2E models, transformer-AED achieved the best accuracy in both streaming and non-streaming mode. We show that both streaming RNN-T and transformer-AED models can obtain better accuracy than a highly-optimized hybrid model.

引用

页码：1 / 5

页数：5

共 50 条

[21] END-TO-END AUDIOVISUAL SPEECH RECOGNITION
Petridis, Stavros
Stafylakis, Themos
Ma, Pingchuan
Cai, Feipeng
Tzimiropoulos, Georgios
Pantic, Maja
2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 6548 - 6552
[22] END-TO-END ANCHORED SPEECH RECOGNITION
Wang, Yiming
Fan, Xing
Chen, I-Fan
Liu, Yuzong
Chen, Tongfei
Hoffmeister, Bjorn
2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 7090 - 7094
[23] End-to-end visual speech recognition for small-scale datasets
Petridis, Stavros
Wang, Yujiang
Ma, Pingchuan
Li, Zuwei
Pantic, Maja
PATTERN RECOGNITION LETTERS, 2020, 131 : 421 - 427
[24] END-TO-END ATTENTION-BASED LARGE VOCABULARY SPEECH RECOGNITION
Bandanau, Dzmitry
Chorowski, Jan
Serdyuk, Dmitriy
Brakel, Philemon
Bengio, Yoshua
2016 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING PROCEEDINGS, 2016, : 4945 - 4949
[25] Large Margin Training for Attention Based End-to-End Speech Recognition
Wang, Peidong
Cui, Jia
Weng, Chao
Yu, Dong
INTERSPEECH 2019, 2019, : 246 - 250
[26] Confidence-based Ensembles of End-to-End Speech Recognition Models
Gitman, Igor
Lavrukhin, Vitaly
Laptev, Aleksandr
Ginsburg, Boris
INTERSPEECH 2023, 2023, : 1414 - 1418
[27] Residual Energy-Based Models for End-to-End Speech Recognition
Li, Qiujia
Zhang, Yu
Li, Bo
Cao, Liangliang
Woodland, Philip C.
INTERSPEECH 2021, 2021, : 4069 - 4073
[28] Do End-to-End Speech Recognition Models Care About Context?
Borgholt, Lasse
Havtorn, Jakob D.
Agic, Zeljko
Sogaard, Anders
Maaloe, Lars
Igel, Christian
INTERSPEECH 2020, 2020, : 4352 - 4356
[29] IMPROVING UNSUPERVISED STYLE TRANSFER IN END-TO-END SPEECH SYNTHESIS WITH END-TO-END SPEECH RECOGNITION
Liu, Da-Rong
Yang, Chi-Yu
Wu, Szu-Lin
Lee, Hung-Yi
2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 640 - 647
[30] SYNCHRONOUS TRANSFORMERS FOR END-TO-END SPEECH RECOGNITION
Tian, Zhengkun
Yi, Jiangyan
Bai, Ye
Tao, Jianhua
Zhang, Shuai
Wen, Zhengqi
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7884 - 7888

← 1 2 3 4 5 →