Improving Transformer-based End-to-End Speech Recognition with Connectionist Temporal Classification and Language Model Integration

被引：135

作者：

Karita, Shigeki ^{[1
]}

Soplin, Nelson Enrique Yalta ^{[2
]}

Watanabe, Shinji ^{[3
]}

Delcroix, Marc ^{[1
]}

Ogawa, Atsunori ^{[1
]}

Nakatani, Tomohiro ^{[1
]}

机构：

[1] NTT Commun Sci Labs, Kyoto, Japan

[2] Waseda Univ, Tokyo, Japan

[3] Johns Hopkins Univ, Ctr Language & Speech Proc, Baltimore, MD 21218 USA

来源：

INTERSPEECH 2019 | 2019年

关键词：

speech recognition; Transformer; connectionist temporal classification; language model;

D O I：

10.21437/Interspeech.2019-1938

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

The state-of-the-art neural network architecture named Transformer has been used successfully for many sequence-to-sequence transformation tasks. The advantage of this architecture is that it has a fast iteration speed in the training stage because there is no sequential operation as with recurrent neural networks (RNN). However, an RNN is still the best option for end-to-end automatic speech recognition (ASR) tasks in terms of overall training speed (i.e., convergence) and word error rate (WER) because of effective joint training and decoding methods. To realize a faster and more accurate ASR system, we combine Transformer and the advances in RNN-based ASR. In our experiments, we found that the training of Transformer is slower than that of RNN as regards the learning curve and integration with the naive language model (LM) is difficult. To address these problems, we integrate connectionist temporal classification (CTC) with Transformer for joint training and decoding. This approach makes training faster than with RNNs and assists LM integration. Our proposed ASR system realizes significant improvements in various ASR tasks. For example, it reduced the WERs from 11.1% to 4.5% on the Wall Street Journal and from 16.1% to 11.6% on the TED-LIUM by introducing CTC and LM integration into the Transformer baseline.

引用

页码：1408 / 1412

页数：5

共 50 条

[21] Speech-and-Text Transformer: Exploiting Unpaired Text for End-to-End Speech Recognition
Wang, Qinyi
Zhou, Xinyuan
Li, Haizhou
APSIPA TRANSACTIONS ON SIGNAL AND INFORMATION PROCESSING, 2023, 12 (01)
[22] End-to-end automated speech recognition using a character based small scale transformer architecture
Loubser, Alexander
De Villiers, Pieter
De Freitas, Allan
EXPERT SYSTEMS WITH APPLICATIONS, 2024, 252
[23] HyperSFormer: A Transformer-Based End-to-End Hyperspectral Image Classification Method for Crop Classification
Xie, Jiaxing
Hua, Jiajun
Chen, Shaonan
Wu, Peiwen
Gao, Peng
Sun, Daozong
Lyu, Zhendong
Lyu, Shilei
Xue, Xiuyun
Lu, Jianqiang
REMOTE SENSING, 2023, 15 (14)
[24] Investigation of Transformer based Spelling Correction Model for CTC-based End-to-End Mandarin Speech Recognition
Zhang, Shiliang
Lei, Ming
Yan, Zhijie
INTERSPEECH 2019, 2019, : 2180 - 2184
[25] An End-to-End Chinese Speech Recognition Algorithm Integrating Language Model
Lü, Kun-Ru
Wu, Chun-Guo
Liang, Yan-Chun
Yuan, Yu-Ping
Ren, Zhi-Min
Zhou, You
Shi, Xiao-Hu
Tien Tzu Hsueh Pao/Acta Electronica Sinica, 2021, 49 (11): : 2177 - 2185
[26] INTERNAL LANGUAGE MODEL ESTIMATION FOR DOMAIN-ADAPTIVE END-TO-END SPEECH RECOGNITION
Meng, Zhong
Parthasarathy, Sarangarajan
Sun, Eric
Gaur, Yashesh
Kanda, Naoyuki
Lu, Liang
Chen, Xie
Zhao, Rui
Li, Jinyu
Gong, Yifan
2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 243 - 250
[27] INTERNAL LANGUAGE MODEL TRAINING FOR DOMAIN-ADAPTIVE END-TO-END SPEECH RECOGNITION
Meng, Zhong
Kanda, Naoyuki
Gaur, Yashesh
Parthasarathy, Sarangarajan
Sun, Eric
Lu, Liang
Chen, Xie
Li, Jinyu
Gong, Yifan
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7338 - 7342
[28] Improving Streaming End-to-End ASR on Transformer-based Causal Models with Encoder States Revision Strategies
Li, Zehan
Miao, Haoran
Deng, Keqi
Cheng, Gaofeng
Tian, Sanli
Li, Ta
Yan, Yonghong
INTERSPEECH 2022, 2022, : 1671 - 1675
[29] A SPELLING CORRECTION MODEL FOR END-TO-END SPEECH RECOGNITION
Guo, Jinxi
Sainath, Tara N.
Weiss, Ron J.
2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 5651 - 5655
[30] Hardware Accelerator for Transformer based End-to-End Automatic Speech Recognition System
Yamini, Shaarada D.
Mirishkar, Ganesh S.
Vuppala, Anil Kumar
Purini, Suresh
2023 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS, IPDPSW, 2023, : 93 - 100

← 1 2 3 4 5 →