SELF-ATTENTION NETWORKS FOR CONNECTIONIST TEMPORAL CLASSIFICATION IN SPEECH RECOGNITION

被引:0
作者
Salazar, Julian [1 ]
Kirchhoff, Katrin [1 ,2 ]
Huang, Zhiheng [1 ]
机构
[1] Amazon AI, Seattle, WA 98109 USA
[2] Univ Washington, Seattle, WA 98195 USA
来源
2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2019年
关键词
speech recognition; connectionist temporal classification; self-attention; multi-head attention; end-to-end;
D O I
10.1109/icassp.2019.8682539
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
The success of self-attention in NLP has led to recent applications in end-to-end encoder-decoder architectures for speech recognition. Separately, connectionist temporal classification (CTC) has matured as an alignment-free, non-autoregressive approach to sequence transduction, either by itself or in various multitask and decoding frameworks. We propose SAN-CTC, a deep, fully self-attentional network for CTC, and show it is tractable and competitive for end-to-end speech recognition. SAN-CTC trains quickly and outperforms existing CTC models and most encoder-decoder models, with character error rates (CERs) of 4.7% in 1 day on WSJ eval92 and 2.8% in 1 week on LibriSpeech test-clean, with a fixed architecture and one GPU. Similar improvements hold for WERs after LM decoding. We motivate the architecture for speech, evaluate position and down-sampling approaches, and explore how label alphabets (character, phoneme, subword) affect attention heads and performance.
引用
收藏
页码:7115 / 7119
页数:5
相关论文
共 50 条
[21]   GAUSSIAN KERNELIZED SELF-ATTENTION FOR LONG SEQUENCE DATA AND ITS APPLICATION TO CTC-BASED SPEECH RECOGNITION [J].
Kashiwagi, Yosuke ;
Tsunoo, Emiru ;
Watanabe, Shinji .
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, :6214-6218
[22]   Self-Attention Channel Combinator Frontend for End-to-End Multichannel Far-field Speech Recognition [J].
Gong, Rong ;
Quillen, Carl ;
Sharma, Dushyant ;
Goderre, Andrew ;
Lainez, Jose ;
Milanovic, Ljubomir .
INTERSPEECH 2021, 2021, :3840-3844
[23]   Self-Attention Enhanced Recurrent Neural Networks for Sentence Classification [J].
Kumar, Ankit ;
Rastogi , Reshma .
2018 IEEE SYMPOSIUM SERIES ON COMPUTATIONAL INTELLIGENCE (IEEE SSCI), 2018, :905-911
[24]   SAN-M: Memory Equipped Self-Attention for End-to-End Speech Recognition [J].
Gao, Zhifu ;
Zhang, Shiliang ;
Lei, Ming ;
McLoughlin, Ian .
INTERSPEECH 2020, 2020, :6-10
[25]   Self-Attention Networks For Motion Posture Recognition Based On Data Fusion [J].
Ji, Zhihao ;
Xie, Qiang .
4TH INTERNATIONAL CONFERENCE ON INFORMATICS ENGINEERING AND INFORMATION SCIENCE (ICIEIS2021), 2022, 12161
[26]   Self-Attention Networks for Human Activity Recognition Using Wearable Devices [J].
Betancourt, Carlos ;
Chen, Wen-Hui ;
Kuan, Chi-Wei .
2020 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS (SMC), 2020, :1194-1199
[27]   Deformable Self-Attention for Text Classification [J].
Ma, Qianli ;
Yan, Jiangyue ;
Lin, Zhenxi ;
Yu, Liuhong ;
Chen, Zipeng .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 :1570-1581
[28]   Improving Transformer-based End-to-End Speech Recognition with Connectionist Temporal Classification and Language Model Integration [J].
Karita, Shigeki ;
Soplin, Nelson Enrique Yalta ;
Watanabe, Shinji ;
Delcroix, Marc ;
Ogawa, Atsunori ;
Nakatani, Tomohiro .
INTERSPEECH 2019, 2019, :1408-1412
[29]   SPEECH DENOISING IN THE WAVEFORM DOMAIN WITH SELF-ATTENTION [J].
Kong, Zhifeng ;
Ping, Wei ;
Dantrey, Ambrish ;
Catanzaro, Bryan .
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, :7867-7871
[30]   Improving Hybrid CTC/Attention Architecture with Time-Restricted Self-Attention CTC for End-to-End Speech Recognition [J].
Wu, Long ;
Li, Ta ;
Wang, Li ;
Yan, Yonghong .
APPLIED SCIENCES-BASEL, 2019, 9 (21)