Improving Transformer-based End-to-End Speech Recognition with Connectionist Temporal Classification and Language Model Integration

被引:135
|
作者
Karita, Shigeki [1 ]
Soplin, Nelson Enrique Yalta [2 ]
Watanabe, Shinji [3 ]
Delcroix, Marc [1 ]
Ogawa, Atsunori [1 ]
Nakatani, Tomohiro [1 ]
机构
[1] NTT Commun Sci Labs, Kyoto, Japan
[2] Waseda Univ, Tokyo, Japan
[3] Johns Hopkins Univ, Ctr Language & Speech Proc, Baltimore, MD 21218 USA
来源
INTERSPEECH 2019 | 2019年
关键词
speech recognition; Transformer; connectionist temporal classification; language model;
D O I
10.21437/Interspeech.2019-1938
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
The state-of-the-art neural network architecture named Transformer has been used successfully for many sequence-to-sequence transformation tasks. The advantage of this architecture is that it has a fast iteration speed in the training stage because there is no sequential operation as with recurrent neural networks (RNN). However, an RNN is still the best option for end-to-end automatic speech recognition (ASR) tasks in terms of overall training speed (i.e., convergence) and word error rate (WER) because of effective joint training and decoding methods. To realize a faster and more accurate ASR system, we combine Transformer and the advances in RNN-based ASR. In our experiments, we found that the training of Transformer is slower than that of RNN as regards the learning curve and integration with the naive language model (LM) is difficult. To address these problems, we integrate connectionist temporal classification (CTC) with Transformer for joint training and decoding. This approach makes training faster than with RNNs and assists LM integration. Our proposed ASR system realizes significant improvements in various ASR tasks. For example, it reduced the WERs from 11.1% to 4.5% on the Wall Street Journal and from 16.1% to 11.6% on the TED-LIUM by introducing CTC and LM integration into the Transformer baseline.
引用
收藏
页码:1408 / 1412
页数:5
相关论文
共 50 条
  • [21] Speech-and-Text Transformer: Exploiting Unpaired Text for End-to-End Speech Recognition
    Wang, Qinyi
    Zhou, Xinyuan
    Li, Haizhou
    APSIPA TRANSACTIONS ON SIGNAL AND INFORMATION PROCESSING, 2023, 12 (01)
  • [22] End-to-end automated speech recognition using a character based small scale transformer architecture
    Loubser, Alexander
    De Villiers, Pieter
    De Freitas, Allan
    EXPERT SYSTEMS WITH APPLICATIONS, 2024, 252
  • [23] HyperSFormer: A Transformer-Based End-to-End Hyperspectral Image Classification Method for Crop Classification
    Xie, Jiaxing
    Hua, Jiajun
    Chen, Shaonan
    Wu, Peiwen
    Gao, Peng
    Sun, Daozong
    Lyu, Zhendong
    Lyu, Shilei
    Xue, Xiuyun
    Lu, Jianqiang
    REMOTE SENSING, 2023, 15 (14)
  • [24] Investigation of Transformer based Spelling Correction Model for CTC-based End-to-End Mandarin Speech Recognition
    Zhang, Shiliang
    Lei, Ming
    Yan, Zhijie
    INTERSPEECH 2019, 2019, : 2180 - 2184
  • [25] An End-to-End Chinese Speech Recognition Algorithm Integrating Language Model
    Lü, Kun-Ru
    Wu, Chun-Guo
    Liang, Yan-Chun
    Yuan, Yu-Ping
    Ren, Zhi-Min
    Zhou, You
    Shi, Xiao-Hu
    Tien Tzu Hsueh Pao/Acta Electronica Sinica, 2021, 49 (11): : 2177 - 2185
  • [26] INTERNAL LANGUAGE MODEL ESTIMATION FOR DOMAIN-ADAPTIVE END-TO-END SPEECH RECOGNITION
    Meng, Zhong
    Parthasarathy, Sarangarajan
    Sun, Eric
    Gaur, Yashesh
    Kanda, Naoyuki
    Lu, Liang
    Chen, Xie
    Zhao, Rui
    Li, Jinyu
    Gong, Yifan
    2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 243 - 250
  • [27] INTERNAL LANGUAGE MODEL TRAINING FOR DOMAIN-ADAPTIVE END-TO-END SPEECH RECOGNITION
    Meng, Zhong
    Kanda, Naoyuki
    Gaur, Yashesh
    Parthasarathy, Sarangarajan
    Sun, Eric
    Lu, Liang
    Chen, Xie
    Li, Jinyu
    Gong, Yifan
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7338 - 7342
  • [28] Improving Streaming End-to-End ASR on Transformer-based Causal Models with Encoder States Revision Strategies
    Li, Zehan
    Miao, Haoran
    Deng, Keqi
    Cheng, Gaofeng
    Tian, Sanli
    Li, Ta
    Yan, Yonghong
    INTERSPEECH 2022, 2022, : 1671 - 1675
  • [29] A SPELLING CORRECTION MODEL FOR END-TO-END SPEECH RECOGNITION
    Guo, Jinxi
    Sainath, Tara N.
    Weiss, Ron J.
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 5651 - 5655
  • [30] Hardware Accelerator for Transformer based End-to-End Automatic Speech Recognition System
    Yamini, Shaarada D.
    Mirishkar, Ganesh S.
    Vuppala, Anil Kumar
    Purini, Suresh
    2023 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS, IPDPSW, 2023, : 93 - 100