Key Frame Mechanism for Efficient Conformer Based End-to-End Speech Recognition

被引：0

作者：

Fan, Peng ^{[1
]}

Shan, Changhao ^{[2
]}

Sun, Sining ^{[2
]}

Yang, Qing ^{[2
]}

Zhang, Jianwei ^{[3
]}

机构：

[1] Sichuan Univ, Natl Key Lab Fundamental Sci Synthet Vis, Chengdu 610065, Peoples R China

[2] Du Xiaoman Financial, Beijing 100089, Peoples R China

[3] Sichuan Univ, Coll Comp Sci, Chengdu 610065, Peoples R China

来源：

IEEE SIGNAL PROCESSING LETTERS | 2023年 / 30卷

关键词：

Automatic speech recognition; self-attention; key frame; signal processing; drop frame;

D O I：

10.1109/LSP.2023.3327585

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Recently, Conformer as a backbone network for end-to-end automatic speech recognition achieved state-of-the-art performance. The Conformer block leverages a self-attention mechanism to capture global information, along with a convolutional neural network to capture local information, resulting in improved performance. However, the Conformer-based model encounters an issue with the self-attention mechanism, as computational complexity grows quadratically with the length of the input sequence. Inspired by previous Connectionist Temporal Classification (CTC) guided blank skipping during decoding, we introduce intermediate CTC outputs as guidance into the downsampling procedure of the Conformer encoder. We define the frame with non-blank output as key frame. Specifically, we introduce the key frame-based self-attention (KFSA) mechanism, a novel method to reduce the computation of the self-attention mechanism using key frames. The structure of our proposed approach comprises two encoders. Following the initial encoder, we introduce an intermediate CTC loss function to compute the label frame, enabling us to extract the key frames and blank frames for KFSA. Furthermore, we introduce the key frame-based downsampling (KFDS) mechanism to operate on high-dimensional acoustic features directly and drop the frames corresponding to blank labels, which results in new acoustic feature sequences as input to the second encoder. By using the proposed method, which achieves comparable or higher performance than vanilla Conformer and other similar work such as Efficient Conformer. Meantime, our proposed method can discard more than 60% useless frames during model training and inference, which will accelerate the inference speed significantly.

引用

页码：1612 / 1616

页数：5

共 25 条

[1] Battenberg E, 2017, 2017 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), P206, DOI 10.1109/ASRU.2017.8268937
[2] "Masks do not work": COVID-19 misperceptions and theory-driven corrective strategies on Facebook
Borah, Porismita
Kim, Sojung
Hsu, Ying-Chia
[J]. ONLINE INFORMATION REVIEW, 2023, 47 (05) : 880 - 905
[3] Bu H, 2017, 2017 20TH CONFERENCE OF THE ORIENTAL CHAPTER OF THE INTERNATIONAL COORDINATING COMMITTEE ON SPEECH DATABASES AND SPEECH I/O SYSTEMS AND ASSESSMENT (O-COCOSDA), P58, DOI 10.1109/ICSDA.2017.8384449
[4] EFFICIENT CONFORMER: PROGRESSIVE DOWNSAMPLING AND GROUPED ATTENTION FOR AUTOMATIC SPEECH RECOGNITION
Burchi, Maxime
Vielzeuf, Valentin
[J]. 2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 8 - 15
[5] Chan W, 2016, INT CONF ACOUST SPEE, P4960, DOI 10.1109/ICASSP.2016.7472621
[6] Phone Synchronous Decoding with CTC Lattice
Chen, Zhehuai
Deng, Wei
Xu, Tao
Yu, Kai
[J]. 17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 1923 - 1927
[7] Dong LH, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P5884, DOI 10.1109/ICASSP.2018.8462506
[8] Speech Recognition for Air Traffic Control via Feature Learning and End-to-End Training
Fan, Peng
Hua, Xiyao
Lin, Yi
Yang, Bo
Zhang, Jianwei
Ge, Wenyi
Guo, Dongyue
[J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2023, E106D (04) : 538 - 544
[9] Graves A., 2006, P ICML, P369
[10] Graves A, 2012, Arxiv, DOI arXiv:1211.3711

← 1 2 3 →