Key Frame Mechanism for Efficient Conformer Based End-to-End Speech Recognition

被引:0
作者
Fan, Peng [1 ]
Shan, Changhao [2 ]
Sun, Sining [2 ]
Yang, Qing [2 ]
Zhang, Jianwei [3 ]
机构
[1] Sichuan Univ, Natl Key Lab Fundamental Sci Synthet Vis, Chengdu 610065, Peoples R China
[2] Du Xiaoman Financial, Beijing 100089, Peoples R China
[3] Sichuan Univ, Coll Comp Sci, Chengdu 610065, Peoples R China
关键词
Automatic speech recognition; self-attention; key frame; signal processing; drop frame;
D O I
10.1109/LSP.2023.3327585
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Recently, Conformer as a backbone network for end-to-end automatic speech recognition achieved state-of-the-art performance. The Conformer block leverages a self-attention mechanism to capture global information, along with a convolutional neural network to capture local information, resulting in improved performance. However, the Conformer-based model encounters an issue with the self-attention mechanism, as computational complexity grows quadratically with the length of the input sequence. Inspired by previous Connectionist Temporal Classification (CTC) guided blank skipping during decoding, we introduce intermediate CTC outputs as guidance into the downsampling procedure of the Conformer encoder. We define the frame with non-blank output as key frame. Specifically, we introduce the key frame-based self-attention (KFSA) mechanism, a novel method to reduce the computation of the self-attention mechanism using key frames. The structure of our proposed approach comprises two encoders. Following the initial encoder, we introduce an intermediate CTC loss function to compute the label frame, enabling us to extract the key frames and blank frames for KFSA. Furthermore, we introduce the key frame-based downsampling (KFDS) mechanism to operate on high-dimensional acoustic features directly and drop the frames corresponding to blank labels, which results in new acoustic feature sequences as input to the second encoder. By using the proposed method, which achieves comparable or higher performance than vanilla Conformer and other similar work such as Efficient Conformer. Meantime, our proposed method can discard more than 60% useless frames during model training and inference, which will accelerate the inference speed significantly.
引用
收藏
页码:1612 / 1616
页数:5
相关论文
共 25 条
  • [1] Battenberg E, 2017, 2017 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), P206, DOI 10.1109/ASRU.2017.8268937
  • [2] "Masks do not work": COVID-19 misperceptions and theory-driven corrective strategies on Facebook
    Borah, Porismita
    Kim, Sojung
    Hsu, Ying-Chia
    [J]. ONLINE INFORMATION REVIEW, 2023, 47 (05) : 880 - 905
  • [3] Bu H, 2017, 2017 20TH CONFERENCE OF THE ORIENTAL CHAPTER OF THE INTERNATIONAL COORDINATING COMMITTEE ON SPEECH DATABASES AND SPEECH I/O SYSTEMS AND ASSESSMENT (O-COCOSDA), P58, DOI 10.1109/ICSDA.2017.8384449
  • [4] EFFICIENT CONFORMER: PROGRESSIVE DOWNSAMPLING AND GROUPED ATTENTION FOR AUTOMATIC SPEECH RECOGNITION
    Burchi, Maxime
    Vielzeuf, Valentin
    [J]. 2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 8 - 15
  • [5] Chan W, 2016, INT CONF ACOUST SPEE, P4960, DOI 10.1109/ICASSP.2016.7472621
  • [6] Phone Synchronous Decoding with CTC Lattice
    Chen, Zhehuai
    Deng, Wei
    Xu, Tao
    Yu, Kai
    [J]. 17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 1923 - 1927
  • [7] Dong LH, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P5884, DOI 10.1109/ICASSP.2018.8462506
  • [8] Speech Recognition for Air Traffic Control via Feature Learning and End-to-End Training
    Fan, Peng
    Hua, Xiyao
    Lin, Yi
    Yang, Bo
    Zhang, Jianwei
    Ge, Wenyi
    Guo, Dongyue
    [J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2023, E106D (04) : 538 - 544
  • [9] Graves A., 2006, P ICML, P369
  • [10] Graves A, 2012, Arxiv, DOI arXiv:1211.3711