SELF-ATTENTION NETWORKS FOR CONNECTIONIST TEMPORAL CLASSIFICATION IN SPEECH RECOGNITION

被引：0

作者：

Salazar, Julian ^{[1
]}

Kirchhoff, Katrin ^{[1
,2
]}

Huang, Zhiheng ^{[1
]}

机构：

[1] Amazon AI, Seattle, WA 98109 USA

[2] Univ Washington, Seattle, WA 98195 USA

来源：

2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2019年

关键词：

speech recognition; connectionist temporal classification; self-attention; multi-head attention; end-to-end;

D O I：

10.1109/icassp.2019.8682539

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

The success of self-attention in NLP has led to recent applications in end-to-end encoder-decoder architectures for speech recognition. Separately, connectionist temporal classification (CTC) has matured as an alignment-free, non-autoregressive approach to sequence transduction, either by itself or in various multitask and decoding frameworks. We propose SAN-CTC, a deep, fully self-attentional network for CTC, and show it is tractable and competitive for end-to-end speech recognition. SAN-CTC trains quickly and outperforms existing CTC models and most encoder-decoder models, with character error rates (CERs) of 4.7% in 1 day on WSJ eval92 and 2.8% in 1 week on LibriSpeech test-clean, with a fixed architecture and one GPU. Similar improvements hold for WERs after LM decoding. We motivate the architecture for speech, evaluate position and down-sampling approaches, and explore how label alphabets (character, phoneme, subword) affect attention heads and performance.

引用

页码：7115 / 7119

页数：5

共 50 条

[1] ON THE USEFULNESS OF SELF-ATTENTION FOR AUTOMATIC SPEECH RECOGNITION WITH TRANSFORMERS
Zhang, Shucong
Loweimi, Erfan
Bell, Peter
Renals, Steve
2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 89 - 96
[2] Attention-enhanced Connectionist Temporal Classification for Discrete Speech Emotion Recognition
Zhao, Ziping
Bao, Zhongtian
Zhang, Zixing
Cummins, Nicholas
Wang, Haishuai
Schuller, Bjorn W.
INTERSPEECH 2019, 2019, : 206 - 210
[3] Self-attention for Speech Emotion Recognition
Tarantino, Lorenzo
Garner, Philip N.
Lazaridis, Alexandros
INTERSPEECH 2019, 2019, : 2578 - 2582
[4] Speech emotion recognition using recurrent neural networks with directional self-attention
Li, Dongdong
Liu, Jinlin
Yang, Zhuo
Sun, Linyu
Wang, Zhe
EXPERT SYSTEMS WITH APPLICATIONS, 2021, 173
[5] Very Deep Self-Attention Networks for End-to-End Speech Recognition
Ngoc-Quan Pham
Thai-Son Nguyen
Niehues, Jan
Mueller, Markus
Waibel, Alex
INTERSPEECH 2019, 2019, : 66 - 70
[6] Applying Connectionist Temporal Classification Objective Function to Chinese Mandarin Speech Recognition
Wang, Pengrui
Li, Jie
Xu, Bo
2016 10TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2016,
[7] Multilingual Speech Recognition with Self-Attention Structured Parameterization
Zhu, Yun
Haghani, Parisa
Tripathi, Anshuman
Ramabhadran, Bhuvana
Farris, Brian
Xu, Hainan
Lu, Han
Sak, Hasim
Leal, Isabel
Gaur, Neeraj
Moreno, Pedro J.
Zhang, Qian
INTERSPEECH 2020, 2020, : 4741 - 4745
[8] ESAformer: Enhanced Self-Attention for Automatic Speech Recognition
Li, Junhua
Duan, Zhikui
Li, Shiren
Yu, Xinmei
Yang, Guangguang
IEEE SIGNAL PROCESSING LETTERS, 2024, 31 : 471 - 475
[9] Multi-Stride Self-Attention for Speech Recognition
Han, Kyu J.
Huang, Jing
Tang, Yun
He, Xiaodong
Zhou, Bowen
INTERSPEECH 2019, 2019, : 2788 - 2792
[10] Self-Attention Transducers for End-to-End Speech Recognition
Tian, Zhengkun
Yi, Jiangyan
Tao, Jianhua
Bai, Ye
Wen, Zhengqi
INTERSPEECH 2019, 2019, : 4395 - 4399

← 1 2 3 4 5 →