SELF-ATTENTION NETWORKS FOR CONNECTIONIST TEMPORAL CLASSIFICATION IN SPEECH RECOGNITION

被引:0
作者
Salazar, Julian [1 ]
Kirchhoff, Katrin [1 ,2 ]
Huang, Zhiheng [1 ]
机构
[1] Amazon AI, Seattle, WA 98109 USA
[2] Univ Washington, Seattle, WA 98195 USA
来源
2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2019年
关键词
speech recognition; connectionist temporal classification; self-attention; multi-head attention; end-to-end;
D O I
10.1109/icassp.2019.8682539
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
The success of self-attention in NLP has led to recent applications in end-to-end encoder-decoder architectures for speech recognition. Separately, connectionist temporal classification (CTC) has matured as an alignment-free, non-autoregressive approach to sequence transduction, either by itself or in various multitask and decoding frameworks. We propose SAN-CTC, a deep, fully self-attentional network for CTC, and show it is tractable and competitive for end-to-end speech recognition. SAN-CTC trains quickly and outperforms existing CTC models and most encoder-decoder models, with character error rates (CERs) of 4.7% in 1 day on WSJ eval92 and 2.8% in 1 week on LibriSpeech test-clean, with a fixed architecture and one GPU. Similar improvements hold for WERs after LM decoding. We motivate the architecture for speech, evaluate position and down-sampling approaches, and explore how label alphabets (character, phoneme, subword) affect attention heads and performance.
引用
收藏
页码:7115 / 7119
页数:5
相关论文
共 50 条
[41]   Bidirectional Temporal Convolution with Self-Attention Network for CTC-Based Acoustic Modeling [J].
Sun, Jian ;
Guo, Wu ;
Gu, Bin ;
Liu, Yao .
2019 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2019, :1262-1266
[42]   Monaural Speech Dereverberation Using Temporal Convolutional Networks With Self Attention [J].
Zhao, Yan ;
Wang, DeLiang ;
Xu, Buye ;
Zhang, Tao .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2020, 28 :1598-1607
[43]   SELF-ATTENTION GENERATIVE ADVERSARIAL NETWORK FOR SPEECH ENHANCEMENT [J].
Huy Phan ;
Nguyen, Huy Le ;
Chen, Oliver Y. ;
Koch, Philipp ;
Duong, Ngoc Q. K. ;
McLoughlin, Ian ;
Mertins, Alfred .
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, :7103-7107
[44]   Time-Frequency Deep Representation Learning for Speech Emotion Recognition Integrating Self-attention [J].
Liu, Jiaxing ;
Liu, Zhilei ;
Wang, Longbiao ;
Guo, Lili ;
Dang, Jianwu .
NEURAL INFORMATION PROCESSING (ICONIP 2019), PT IV, 2019, 1142 :681-689
[45]   Automatic Food Recognition Using Deep Convolutional Neural Networks with Self-attention Mechanism [J].
Rahib Abiyev ;
Joseph Adepoju .
Human-Centric Intelligent Systems, 2024, 4 (1) :171-186
[46]   Improving Self-Attention Networks With Sequential Relations [J].
Zheng, Zaixiang ;
Huang, Shujian ;
Weng, Rongxiang ;
Dai, Xinyu ;
Chen, Jiajun .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2020, 28 :1707-1716
[47]   Industrial data classification using stochastic configuration networks with self-attention learning features [J].
Li, Weitao ;
Deng, Yali ;
Ding, Meishuang ;
Wang, Dianhui ;
Sun, Wei ;
Li, Qiyue .
NEURAL COMPUTING & APPLICATIONS, 2022, 34 (24) :22047-22069
[48]   Multi-scale convolution networks for seismic event classification with windowed self-attention [J].
Huang, Yongming ;
Xie, Yi ;
Liu, Wei ;
Ma, Yongsheng ;
Miao, Fajun ;
Zhang, Guobao .
JOURNAL OF SEISMOLOGY, 2025, 29 (01) :257-268
[49]   Industrial data classification using stochastic configuration networks with self-attention learning features [J].
Weitao Li ;
Yali Deng ;
Meishuang Ding ;
Dianhui Wang ;
Wei Sun ;
Qiyue Li .
Neural Computing and Applications, 2022, 34 :22047-22069
[50]   Spatio-Temporal Action Detector with Self-Attention [J].
Ma, Xurui ;
Luo, Zhigang ;
Zhang, Xiang ;
Liao, Qing ;
Shen, Xingyu ;
Wang, Mengzhu .
2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,