SELF-ATTENTION NETWORKS FOR CONNECTIONIST TEMPORAL CLASSIFICATION IN SPEECH RECOGNITION

被引：0

作者：

Salazar, Julian ^{[1
]}

Kirchhoff, Katrin ^{[1
,2
]}

Huang, Zhiheng ^{[1
]}

机构：

[1] Amazon AI, Seattle, WA 98109 USA

[2] Univ Washington, Seattle, WA 98195 USA

来源：

2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2019年

关键词：

speech recognition; connectionist temporal classification; self-attention; multi-head attention; end-to-end;

D O I：

10.1109/icassp.2019.8682539

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

The success of self-attention in NLP has led to recent applications in end-to-end encoder-decoder architectures for speech recognition. Separately, connectionist temporal classification (CTC) has matured as an alignment-free, non-autoregressive approach to sequence transduction, either by itself or in various multitask and decoding frameworks. We propose SAN-CTC, a deep, fully self-attentional network for CTC, and show it is tractable and competitive for end-to-end speech recognition. SAN-CTC trains quickly and outperforms existing CTC models and most encoder-decoder models, with character error rates (CERs) of 4.7% in 1 day on WSJ eval92 and 2.8% in 1 week on LibriSpeech test-clean, with a fixed architecture and one GPU. Similar improvements hold for WERs after LM decoding. We motivate the architecture for speech, evaluate position and down-sampling approaches, and explore how label alphabets (character, phoneme, subword) affect attention heads and performance.

引用

收藏

页码：7115 / 7119

页数：5

相关论文

共 50 条

[31] Piano Transcription Using Temporal Harmonic Diagram and Transfer Window Attention in Self-Attention Networks [J].

Wu, Qiong ;

Yu, Tao .

Informatica (Slovenia), 2025, 49 (05) :133-145

[32] SIMPLIFIED SELF-ATTENTION FOR TRANSFORMER-BASED END-TO-END SPEECH RECOGNITION [J].

Luo, Haoneng ;

Zhang, Shiliang ;

Lei, Ming ;

Xie, Lei .

2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, :75-81

[33] Acoustic model training using self-attention for low-resource speech recognition [J].

Park, Hosung ;

Kim, Ji-Hwan .

JOURNAL OF THE ACOUSTICAL SOCIETY OF KOREA, 2020, 39 (05) :483-489

[34] Speech Emotion Recognition Based on Self-Attention Weight Correction for Acoustic and Text Features [J].

Santoso, Jennifer ;

Yamada, Takeshi ;

Ishizuka, Kenkichi ;

Hashimoto, Taiichi ;

Makino, Shoji .

IEEE ACCESS, 2022, 10 :115732-115743

[35] NASAL SPEECH SOUNDS DETECTION USING CONNECTIONIST TEMPORAL CLASSIFICATION [J].

Cernak, Milos ;

Tong, Sibo .

2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, :5574-5578

[36] ADAMER-CTC: CONNECTIONIST TEMPORAL CLASSIFICATION WITH ADAPTIVE MAXIMUM ENTROPY REGULARIZATION FOR AUTOMATIC SPEECH RECOGNITION [J].

Eom, SooHwan ;

Yoon, Eunseop ;

Yoon, Hee Suk ;

Kim, Chanwoo ;

Hasegawa-Johnson, Mark ;

Yoo, Chang D. .

2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2024), 2024, :12707-12711

[37] A Class Balanced Spatio-Temporal Self-Attention Model for Combat Intention Recognition [J].

Wang, Xuan ;

Jin, Benzhou ;

Jia, Mingyang ;

Wu, Gang ;

Zhang, Xiaofei .

IEEE ACCESS, 2024, 12 :112074-112084

[38] Cyclic Self-attention for Point Cloud Recognition [J].

Zhu, Guanyu ;

Zhou, Yong ;

Yao, Rui ;

Zhu, Hancheng ;

Zhao, Jiaqi .

ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19 (01)

[39] Exploring Self-Attention for Visual Intersection Classification [J].

Nakata, Haruki ;

Tanaka, Kanji ;

Takeda, Koji .

JOURNAL OF ADVANCED COMPUTATIONAL INTELLIGENCE AND INTELLIGENT INFORMATICS, 2023, 27 (03) :386-393

[40] Efficient decoding self-attention for end-to-end speech synthesis [J].

Zhao, Wei ;

Xu, Li .

FRONTIERS OF INFORMATION TECHNOLOGY & ELECTRONIC ENGINEERING, 2022, 23 (07) :1127-1138

← 1 2 3 4 5 →