On the Comparison of Popular End-to-End Models for Large Scale Speech Recognition

被引:73
|
作者
Li, Jinyu [1 ]
Wu, Yu [2 ]
Gaur, Yashesh [1 ]
Wang, Chengyi [2 ]
Zhao, Rui [1 ]
Liu, Shujie [2 ]
机构
[1] Microsoft Speech & Language Grp, Redmond, WA 98052 USA
[2] Microsoft Res Asia, Beijing, Peoples R China
来源
关键词
end-to-end; RNN-transducer; attention-based encoder-decoder; transformer;
D O I
10.21437/Interspeech.2020-2846
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Recently, there has been a strong push to transition from hybrid models to end-to-end (E2E) models for automatic speech recognition. Currently, there are three promising E2E methods: recurrent neural network transducer (RNN-T), RNN attention-based encoder-decoder (AED), and Transformer-AED. In this study, we conduct an empirical comparison of RNN-T, RNN-AED, and Transformer-AED models, in both non-streaming and streaming modes. We use 65 thousand hours of Microsoft anonymized training data to train these models. As E2E models are more data hungry, it is better to compare their effectiveness with large amount of training data. To the best of our knowledge, no such comprehensive study has been conducted yet. We show that although AED models are stronger than RNN-T in the non-streaming mode, RNN-T is very competitive in streaming mode if its encoder can be properly initialized. Among all three E2E models, transformer-AED achieved the best accuracy in both streaming and non-streaming mode. We show that both streaming RNN-T and transformer-AED models can obtain better accuracy than a highly-optimized hybrid model.
引用
收藏
页码:1 / 5
页数:5
相关论文
共 50 条
  • [21] END-TO-END AUDIOVISUAL SPEECH RECOGNITION
    Petridis, Stavros
    Stafylakis, Themos
    Ma, Pingchuan
    Cai, Feipeng
    Tzimiropoulos, Georgios
    Pantic, Maja
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 6548 - 6552
  • [22] END-TO-END ANCHORED SPEECH RECOGNITION
    Wang, Yiming
    Fan, Xing
    Chen, I-Fan
    Liu, Yuzong
    Chen, Tongfei
    Hoffmeister, Bjorn
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 7090 - 7094
  • [23] End-to-end visual speech recognition for small-scale datasets
    Petridis, Stavros
    Wang, Yujiang
    Ma, Pingchuan
    Li, Zuwei
    Pantic, Maja
    PATTERN RECOGNITION LETTERS, 2020, 131 : 421 - 427
  • [24] END-TO-END ATTENTION-BASED LARGE VOCABULARY SPEECH RECOGNITION
    Bandanau, Dzmitry
    Chorowski, Jan
    Serdyuk, Dmitriy
    Brakel, Philemon
    Bengio, Yoshua
    2016 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING PROCEEDINGS, 2016, : 4945 - 4949
  • [25] Large Margin Training for Attention Based End-to-End Speech Recognition
    Wang, Peidong
    Cui, Jia
    Weng, Chao
    Yu, Dong
    INTERSPEECH 2019, 2019, : 246 - 250
  • [26] Confidence-based Ensembles of End-to-End Speech Recognition Models
    Gitman, Igor
    Lavrukhin, Vitaly
    Laptev, Aleksandr
    Ginsburg, Boris
    INTERSPEECH 2023, 2023, : 1414 - 1418
  • [27] Residual Energy-Based Models for End-to-End Speech Recognition
    Li, Qiujia
    Zhang, Yu
    Li, Bo
    Cao, Liangliang
    Woodland, Philip C.
    INTERSPEECH 2021, 2021, : 4069 - 4073
  • [28] Do End-to-End Speech Recognition Models Care About Context?
    Borgholt, Lasse
    Havtorn, Jakob D.
    Agic, Zeljko
    Sogaard, Anders
    Maaloe, Lars
    Igel, Christian
    INTERSPEECH 2020, 2020, : 4352 - 4356
  • [29] IMPROVING UNSUPERVISED STYLE TRANSFER IN END-TO-END SPEECH SYNTHESIS WITH END-TO-END SPEECH RECOGNITION
    Liu, Da-Rong
    Yang, Chi-Yu
    Wu, Szu-Lin
    Lee, Hung-Yi
    2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 640 - 647
  • [30] SYNCHRONOUS TRANSFORMERS FOR END-TO-END SPEECH RECOGNITION
    Tian, Zhengkun
    Yi, Jiangyan
    Bai, Ye
    Tao, Jianhua
    Zhang, Shuai
    Wen, Zhengqi
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7884 - 7888