On the Comparison of Popular End-to-End Models for Large Scale Speech Recognition

被引:73
|
作者
Li, Jinyu [1 ]
Wu, Yu [2 ]
Gaur, Yashesh [1 ]
Wang, Chengyi [2 ]
Zhao, Rui [1 ]
Liu, Shujie [2 ]
机构
[1] Microsoft Speech & Language Grp, Redmond, WA 98052 USA
[2] Microsoft Res Asia, Beijing, Peoples R China
来源
关键词
end-to-end; RNN-transducer; attention-based encoder-decoder; transformer;
D O I
10.21437/Interspeech.2020-2846
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Recently, there has been a strong push to transition from hybrid models to end-to-end (E2E) models for automatic speech recognition. Currently, there are three promising E2E methods: recurrent neural network transducer (RNN-T), RNN attention-based encoder-decoder (AED), and Transformer-AED. In this study, we conduct an empirical comparison of RNN-T, RNN-AED, and Transformer-AED models, in both non-streaming and streaming modes. We use 65 thousand hours of Microsoft anonymized training data to train these models. As E2E models are more data hungry, it is better to compare their effectiveness with large amount of training data. To the best of our knowledge, no such comprehensive study has been conducted yet. We show that although AED models are stronger than RNN-T in the non-streaming mode, RNN-T is very competitive in streaming mode if its encoder can be properly initialized. Among all three E2E models, transformer-AED achieved the best accuracy in both streaming and non-streaming mode. We show that both streaming RNN-T and transformer-AED models can obtain better accuracy than a highly-optimized hybrid model.
引用
收藏
页码:1 / 5
页数:5
相关论文
共 50 条
  • [1] A COMPARISON OF END-TO-END MODELS FOR LONG-FORM SPEECH RECOGNITION
    Chiu, Chung-Cheng
    Han, Wei
    Zhang, Yu
    Pang, Ruoming
    Kishchenko, Sergey
    Nguyen, Patrick
    Narayanan, Arun
    Liao, Hank
    Zhang, Shuyuan
    Kannan, Anjuli
    Prabhavalkar, Rohit
    Chen, Zhifeng
    Sainath, Tara
    Wu, Yonghui
    2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 889 - 896
  • [2] END-TO-END TRAINING OF A LARGE VOCABULARY END-TO-END SPEECH RECOGNITION SYSTEM
    Kim, Chanwoo
    Kim, Sungsoo
    Kim, Kwangyoun
    Kumar, Mehul
    Kim, Jiyeon
    Lee, Kyungmin
    Han, Changwoo
    Garg, Abhinav
    Kim, Eunhyang
    Shin, Minkyoo
    Singh, Shatrughan
    Heck, Larry
    Gowda, Dhananjaya
    2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 562 - 569
  • [3] Large-Scale Multilingual Speech Recognition with a Streaming End-to-End Model
    Kannan, Anjuli
    Datta, Arindrima
    Sainath, Tara N.
    Weinstein, Eugene
    Ramabhadran, Bhuvana
    Wu, Yonghui
    Bapna, Ankur
    Chen, Zhifeng
    Lee, Seungji
    INTERSPEECH 2019, 2019, : 2130 - 2134
  • [4] End-to-End Neural Segmental Models for Speech Recognition
    Tang, Hao
    Lu, Liang
    Kong, Lingpeng
    Gimpel, Kevin
    Livescu, Karen
    Dyer, Chris
    Smith, Noah A.
    Renals, Steve
    IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2017, 11 (08) : 1254 - 1264
  • [5] Combination of end-to-end and hybrid models for speech recognition
    Wong, Jeremy H. M.
    Gaur, Yashesh
    Zhao, Rui
    Lu, Liang
    Sun, Eric
    Li, Jinyu
    Gong, Yifan
    INTERSPEECH 2020, 2020, : 1783 - 1787
  • [6] AN INVESTIGATION OF END-TO-END MODELS FOR ROBUST SPEECH RECOGNITION
    Prasad, Archiki
    Jyothi, Preethi
    Velmurugan, Rajbabu
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6893 - 6897
  • [7] End-to-End Training of Acoustic Models for Large Vocabulary Continuous Speech Recognition with TensorFlow
    Variani, Ehsan
    Bagby, Tom
    McDermott, Erik
    Bacchiani, Michiel
    18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 1641 - 1645
  • [8] End-to-End Large Vocabulary Speech Recognition for the Serbian Language
    Popovic, Branislav
    Pakoci, Edvin
    Pekar, Darko
    SPEECH AND COMPUTER, SPECOM 2017, 2017, 10458 : 343 - 352
  • [9] Improving End-to-End Models for Children's Speech Recognition
    Patel, Tanvina
    Scharenborg, Odette
    APPLIED SCIENCES-BASEL, 2024, 14 (06):
  • [10] Improved training of end-to-end attention models for speech recognition
    Zeyer, Albert
    Irie, Kazuki
    Schlueter, Ralf
    Ney, Hermann
    19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 7 - 11