Transformer-based ASR Incorporating Time-reduction Layer and Fine-tuning with Self-Knowledge Distillation

被引:3
作者
Haidar, Md Akmal [1 ]
Xing, Chao [1 ]
Rezagholizadeh, Mehdi [1 ]
机构
[1] Huawei Noahs Ark Lab, Montreal Res Ctr, Montreal, PQ, Canada
来源
INTERSPEECH 2021 | 2021年
关键词
speech recognition; transformer; sub-sampling; self-knowledge distillation;
D O I
10.21437/Interspeech.2021-1743
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Reducing the input sequence length of speech features to alleviate the complexity of alignment between speech features and text transcript by sub-sampling approaches is an important way to get better results in end-to-end (E2E) automatic speech recognition (ASR) systems. This issue is more important in Transformer-based ASR, because the self-attention mechanism in Transformers has O(n(2)) order of complexity in both training and inference. In this paper, we propose a Transformer-based ASR model with the time-reduction layer, in which we incorporate time-reduction layer inside transformer encoder layers in addition to traditional sub-sampling methods to input features that further reduce the frame-rate. This can help in reducing the computational cost of the self-attention process for training and inference with performance improvement. Moreover, we introduce a fine-tuning approach for pre-trained ASR models using self-knowledge distillation (S-KD) which further improves the performance of our ASR model. Experiments on LibriSpeech datasets show that our proposed methods outperform all other Transformer-based ASR systems. Furthermore, with language model (LM) fusion, we achieve new state-of-the-art word error rate (WER) results for Transformer-based ASR models with just 30 million parameters trained without any external data.
引用
收藏
页码:2102 / 2106
页数:5
相关论文
共 42 条
  • [1] [Anonymous], 2016, ICML
  • [2] Chan W, 2016, INT CONF ACOUST SPEE, P4960, DOI 10.1109/ICASSP.2016.7472621
  • [3] Distilling knowledge from ensembles of neural networks for speech recognition
    Chebotar, Yevgen
    Waters, Austin
    [J]. 17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 3439 - 3443
  • [4] Dong LH, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P5884, DOI 10.1109/ICASSP.2018.8462506
  • [5] Fukuda T., 2017, INTERSPEECH
  • [6] Furlanello Tommaso, 2018, PMLR, V5
  • [7] Graves A., 2006, P 23 INT C MACH LEAR, P369, DOI 10.1145/1143844.1143891
  • [8] Graves Alex, 2012, CoRR
  • [9] Conformer: Convolution-augmented Transformer for Speech Recognition
    Gulati, Anmol
    Qin, James
    Chiu, Chung-Cheng
    Parmar, Niki
    Zhang, Yu
    Yu, Jiahui
    Han, Wei
    Wang, Shibo
    Zhang, Zhengdong
    Wu, Yonghui
    Pang, Ruoming
    [J]. INTERSPEECH 2020, 2020, : 5036 - 5040
  • [10] Hahn Sangchul, 2019, ARXIV190801851, P2