Improving Transformer-based End-to-End Speech Recognition with Connectionist Temporal Classification and Language Model Integration

被引:135
|
作者
Karita, Shigeki [1 ]
Soplin, Nelson Enrique Yalta [2 ]
Watanabe, Shinji [3 ]
Delcroix, Marc [1 ]
Ogawa, Atsunori [1 ]
Nakatani, Tomohiro [1 ]
机构
[1] NTT Commun Sci Labs, Kyoto, Japan
[2] Waseda Univ, Tokyo, Japan
[3] Johns Hopkins Univ, Ctr Language & Speech Proc, Baltimore, MD 21218 USA
来源
INTERSPEECH 2019 | 2019年
关键词
speech recognition; Transformer; connectionist temporal classification; language model;
D O I
10.21437/Interspeech.2019-1938
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
The state-of-the-art neural network architecture named Transformer has been used successfully for many sequence-to-sequence transformation tasks. The advantage of this architecture is that it has a fast iteration speed in the training stage because there is no sequential operation as with recurrent neural networks (RNN). However, an RNN is still the best option for end-to-end automatic speech recognition (ASR) tasks in terms of overall training speed (i.e., convergence) and word error rate (WER) because of effective joint training and decoding methods. To realize a faster and more accurate ASR system, we combine Transformer and the advances in RNN-based ASR. In our experiments, we found that the training of Transformer is slower than that of RNN as regards the learning curve and integration with the naive language model (LM) is difficult. To address these problems, we integrate connectionist temporal classification (CTC) with Transformer for joint training and decoding. This approach makes training faster than with RNNs and assists LM integration. Our proposed ASR system realizes significant improvements in various ASR tasks. For example, it reduced the WERs from 11.1% to 4.5% on the Wall Street Journal and from 16.1% to 11.6% on the TED-LIUM by introducing CTC and LM integration into the Transformer baseline.
引用
收藏
页码:1408 / 1412
页数:5
相关论文
共 50 条
  • [41] IMPROVING RNN TRANSDUCER MODELING FOR END-TO-END SPEECH RECOGNITION
    Li, Jinyu
    Zhao, Rui
    Hu, Hu
    Gong, Yifan
    2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 114 - 121
  • [42] LIGHTWEIGHT AND EFFICIENT END-TO-END SPEECH RECOGNITION USING LOW-RANK TRANSFORMER
    Winata, Genta Indra
    Cahyawijaya, Samuel
    Lin, Zhaojiang
    Liu, Zihan
    Fung, Pascale
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6144 - 6148
  • [43] Hybrid end-to-end model for Kazakh speech recognition
    Mamyrbayev O.Z.
    Oralbekova D.O.
    Alimhan K.
    Nuranbayeva B.M.
    International Journal of Speech Technology, 2023, 26 (02) : 261 - 270
  • [44] The analysis of transformer end-to-end model in Real-time interactive scene based on speech recognition technology
    Ping Li
    Can Yang
    Lei Mao
    Scientific Reports, 15 (1)
  • [45] MULTILINGUAL SPEECH RECOGNITION WITH A SINGLE END-TO-END MODEL
    Toshniwal, Shubham
    Sainath, Tara N.
    Weiss, Ron J.
    Li, Bo
    Moreno, Pedro
    Weinstein, Eugene
    Rao, Kanishka
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 4904 - 4908
  • [46] End-to-end information fusion method for transformer-based stereo matching
    Xu, Zhenghui
    Wang, Jingxue
    Guo, Jun
    MEASUREMENT SCIENCE AND TECHNOLOGY, 2024, 35 (06)
  • [47] Towards an End-to-End Speech Recognition Model for Accurate Quranic Recitation
    Al-Fadhli, Sumayya
    Al-Harbi, Hajar
    Cherif, Asma
    2023 20TH ACS/IEEE INTERNATIONAL CONFERENCE ON COMPUTER SYSTEMS AND APPLICATIONS, AICCSA, 2023,
  • [48] WARPED ENSEMBLES : A NOVEL TECHNIQUE FOR IMPROVING CTC BASED END-TO-END SPEECH RECOGNITION
    Praveen, Kiran
    Sailor, Hardik
    Pandey, Abhishek
    2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 457 - 464
  • [49] An efficient transformer-based surrogate model with end-to-end training strategies for automatic history matching
    Zhang, Jinding
    Kang, Jinzheng
    Zhang, Kai
    Zhang, Liming
    Liu, Piyang
    Liu, Xingyu
    Sun, Weijia
    Wang, Guangyao
    GEOENERGY SCIENCE AND ENGINEERING, 2024, 240
  • [50] Insertion-Based Modeling for End-to-End Automatic Speech Recognition
    Fujita, Yuya
    Watanabe, Shinji
    Omachi, Motoi
    Chang, Xuankai
    INTERSPEECH 2020, 2020, : 3660 - 3664