SELF-ATTENTION NETWORKS FOR CONNECTIONIST TEMPORAL CLASSIFICATION IN SPEECH RECOGNITION

被引:0
|
作者
Salazar, Julian [1 ]
Kirchhoff, Katrin [1 ,2 ]
Huang, Zhiheng [1 ]
机构
[1] Amazon AI, Seattle, WA 98109 USA
[2] Univ Washington, Seattle, WA 98195 USA
来源
2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2019年
关键词
speech recognition; connectionist temporal classification; self-attention; multi-head attention; end-to-end;
D O I
10.1109/icassp.2019.8682539
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
The success of self-attention in NLP has led to recent applications in end-to-end encoder-decoder architectures for speech recognition. Separately, connectionist temporal classification (CTC) has matured as an alignment-free, non-autoregressive approach to sequence transduction, either by itself or in various multitask and decoding frameworks. We propose SAN-CTC, a deep, fully self-attentional network for CTC, and show it is tractable and competitive for end-to-end speech recognition. SAN-CTC trains quickly and outperforms existing CTC models and most encoder-decoder models, with character error rates (CERs) of 4.7% in 1 day on WSJ eval92 and 2.8% in 1 week on LibriSpeech test-clean, with a fixed architecture and one GPU. Similar improvements hold for WERs after LM decoding. We motivate the architecture for speech, evaluate position and down-sampling approaches, and explore how label alphabets (character, phoneme, subword) affect attention heads and performance.
引用
收藏
页码:7115 / 7119
页数:5
相关论文
共 50 条
  • [1] ON THE USEFULNESS OF SELF-ATTENTION FOR AUTOMATIC SPEECH RECOGNITION WITH TRANSFORMERS
    Zhang, Shucong
    Loweimi, Erfan
    Bell, Peter
    Renals, Steve
    2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 89 - 96
  • [2] Attention-enhanced Connectionist Temporal Classification for Discrete Speech Emotion Recognition
    Zhao, Ziping
    Bao, Zhongtian
    Zhang, Zixing
    Cummins, Nicholas
    Wang, Haishuai
    Schuller, Bjorn W.
    INTERSPEECH 2019, 2019, : 206 - 210
  • [3] Self-attention for Speech Emotion Recognition
    Tarantino, Lorenzo
    Garner, Philip N.
    Lazaridis, Alexandros
    INTERSPEECH 2019, 2019, : 2578 - 2582
  • [4] Speech emotion recognition using recurrent neural networks with directional self-attention
    Li, Dongdong
    Liu, Jinlin
    Yang, Zhuo
    Sun, Linyu
    Wang, Zhe
    EXPERT SYSTEMS WITH APPLICATIONS, 2021, 173
  • [5] Very Deep Self-Attention Networks for End-to-End Speech Recognition
    Ngoc-Quan Pham
    Thai-Son Nguyen
    Niehues, Jan
    Mueller, Markus
    Waibel, Alex
    INTERSPEECH 2019, 2019, : 66 - 70
  • [6] Applying Connectionist Temporal Classification Objective Function to Chinese Mandarin Speech Recognition
    Wang, Pengrui
    Li, Jie
    Xu, Bo
    2016 10TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2016,
  • [7] Multilingual Speech Recognition with Self-Attention Structured Parameterization
    Zhu, Yun
    Haghani, Parisa
    Tripathi, Anshuman
    Ramabhadran, Bhuvana
    Farris, Brian
    Xu, Hainan
    Lu, Han
    Sak, Hasim
    Leal, Isabel
    Gaur, Neeraj
    Moreno, Pedro J.
    Zhang, Qian
    INTERSPEECH 2020, 2020, : 4741 - 4745
  • [8] ESAformer: Enhanced Self-Attention for Automatic Speech Recognition
    Li, Junhua
    Duan, Zhikui
    Li, Shiren
    Yu, Xinmei
    Yang, Guangguang
    IEEE SIGNAL PROCESSING LETTERS, 2024, 31 : 471 - 475
  • [9] Multi-Stride Self-Attention for Speech Recognition
    Han, Kyu J.
    Huang, Jing
    Tang, Yun
    He, Xiaodong
    Zhou, Bowen
    INTERSPEECH 2019, 2019, : 2788 - 2792
  • [10] Self-Attention Transducers for End-to-End Speech Recognition
    Tian, Zhengkun
    Yi, Jiangyan
    Tao, Jianhua
    Bai, Ye
    Wen, Zhengqi
    INTERSPEECH 2019, 2019, : 4395 - 4399