SELF-ATTENTION NETWORKS FOR CONNECTIONIST TEMPORAL CLASSIFICATION IN SPEECH RECOGNITION

被引:0
作者
Salazar, Julian [1 ]
Kirchhoff, Katrin [1 ,2 ]
Huang, Zhiheng [1 ]
机构
[1] Amazon AI, Seattle, WA 98109 USA
[2] Univ Washington, Seattle, WA 98195 USA
来源
2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2019年
关键词
speech recognition; connectionist temporal classification; self-attention; multi-head attention; end-to-end;
D O I
10.1109/icassp.2019.8682539
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
The success of self-attention in NLP has led to recent applications in end-to-end encoder-decoder architectures for speech recognition. Separately, connectionist temporal classification (CTC) has matured as an alignment-free, non-autoregressive approach to sequence transduction, either by itself or in various multitask and decoding frameworks. We propose SAN-CTC, a deep, fully self-attentional network for CTC, and show it is tractable and competitive for end-to-end speech recognition. SAN-CTC trains quickly and outperforms existing CTC models and most encoder-decoder models, with character error rates (CERs) of 4.7% in 1 day on WSJ eval92 and 2.8% in 1 week on LibriSpeech test-clean, with a fixed architecture and one GPU. Similar improvements hold for WERs after LM decoding. We motivate the architecture for speech, evaluate position and down-sampling approaches, and explore how label alphabets (character, phoneme, subword) affect attention heads and performance.
引用
收藏
页码:7115 / 7119
页数:5
相关论文
共 50 条
[31]   Piano Transcription Using Temporal Harmonic Diagram and Transfer Window Attention in Self-Attention Networks [J].
Wu, Qiong ;
Yu, Tao .
Informatica (Slovenia), 2025, 49 (05) :133-145
[32]   SIMPLIFIED SELF-ATTENTION FOR TRANSFORMER-BASED END-TO-END SPEECH RECOGNITION [J].
Luo, Haoneng ;
Zhang, Shiliang ;
Lei, Ming ;
Xie, Lei .
2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, :75-81
[33]   Acoustic model training using self-attention for low-resource speech recognition [J].
Park, Hosung ;
Kim, Ji-Hwan .
JOURNAL OF THE ACOUSTICAL SOCIETY OF KOREA, 2020, 39 (05) :483-489
[34]   Speech Emotion Recognition Based on Self-Attention Weight Correction for Acoustic and Text Features [J].
Santoso, Jennifer ;
Yamada, Takeshi ;
Ishizuka, Kenkichi ;
Hashimoto, Taiichi ;
Makino, Shoji .
IEEE ACCESS, 2022, 10 :115732-115743
[35]   NASAL SPEECH SOUNDS DETECTION USING CONNECTIONIST TEMPORAL CLASSIFICATION [J].
Cernak, Milos ;
Tong, Sibo .
2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, :5574-5578
[36]   ADAMER-CTC: CONNECTIONIST TEMPORAL CLASSIFICATION WITH ADAPTIVE MAXIMUM ENTROPY REGULARIZATION FOR AUTOMATIC SPEECH RECOGNITION [J].
Eom, SooHwan ;
Yoon, Eunseop ;
Yoon, Hee Suk ;
Kim, Chanwoo ;
Hasegawa-Johnson, Mark ;
Yoo, Chang D. .
2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2024), 2024, :12707-12711
[37]   A Class Balanced Spatio-Temporal Self-Attention Model for Combat Intention Recognition [J].
Wang, Xuan ;
Jin, Benzhou ;
Jia, Mingyang ;
Wu, Gang ;
Zhang, Xiaofei .
IEEE ACCESS, 2024, 12 :112074-112084
[38]   Cyclic Self-attention for Point Cloud Recognition [J].
Zhu, Guanyu ;
Zhou, Yong ;
Yao, Rui ;
Zhu, Hancheng ;
Zhao, Jiaqi .
ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19 (01)
[39]   Exploring Self-Attention for Visual Intersection Classification [J].
Nakata, Haruki ;
Tanaka, Kanji ;
Takeda, Koji .
JOURNAL OF ADVANCED COMPUTATIONAL INTELLIGENCE AND INTELLIGENT INFORMATICS, 2023, 27 (03) :386-393
[40]   Efficient decoding self-attention for end-to-end speech synthesis [J].
Zhao, Wei ;
Xu, Li .
FRONTIERS OF INFORMATION TECHNOLOGY & ELECTRONIC ENGINEERING, 2022, 23 (07) :1127-1138