Transformer-based ASR Incorporating Time-reduction Layer and Fine-tuning with Self-Knowledge Distillation

被引：3

作者：

Haidar, Md Akmal ^{[1
]}

Xing, Chao ^{[1
]}

Rezagholizadeh, Mehdi ^{[1
]}

机构：

[1] Huawei Noahs Ark Lab, Montreal Res Ctr, Montreal, PQ, Canada

来源：

INTERSPEECH 2021 | 2021年

关键词：

speech recognition; transformer; sub-sampling; self-knowledge distillation;

D O I：

10.21437/Interspeech.2021-1743

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

Reducing the input sequence length of speech features to alleviate the complexity of alignment between speech features and text transcript by sub-sampling approaches is an important way to get better results in end-to-end (E2E) automatic speech recognition (ASR) systems. This issue is more important in Transformer-based ASR, because the self-attention mechanism in Transformers has O(n(2)) order of complexity in both training and inference. In this paper, we propose a Transformer-based ASR model with the time-reduction layer, in which we incorporate time-reduction layer inside transformer encoder layers in addition to traditional sub-sampling methods to input features that further reduce the frame-rate. This can help in reducing the computational cost of the self-attention process for training and inference with performance improvement. Moreover, we introduce a fine-tuning approach for pre-trained ASR models using self-knowledge distillation (S-KD) which further improves the performance of our ASR model. Experiments on LibriSpeech datasets show that our proposed methods outperform all other Transformer-based ASR systems. Furthermore, with language model (LM) fusion, we achieve new state-of-the-art word error rate (WER) results for Transformer-based ASR models with just 30 million parameters trained without any external data.

引用

页码：2102 / 2106

页数：5

共 42 条

[1] [Anonymous], 2016, ICML
[2] Chan W, 2016, INT CONF ACOUST SPEE, P4960, DOI 10.1109/ICASSP.2016.7472621
[3] Distilling knowledge from ensembles of neural networks for speech recognition
Chebotar, Yevgen
Waters, Austin
[J]. 17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 3439 - 3443
[4] Dong LH, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P5884, DOI 10.1109/ICASSP.2018.8462506
[5] Fukuda T., 2017, INTERSPEECH
[6] Furlanello Tommaso, 2018, PMLR, V5
[7] Graves A., 2006, P 23 INT C MACH LEAR, P369, DOI 10.1145/1143844.1143891
[8] Graves Alex, 2012, CoRR
[9] Conformer: Convolution-augmented Transformer for Speech Recognition
Gulati, Anmol
Qin, James
Chiu, Chung-Cheng
Parmar, Niki
Zhang, Yu
Yu, Jiahui
Han, Wei
Wang, Shibo
Zhang, Zhengdong
Wu, Yonghui
Pang, Ruoming
[J]. INTERSPEECH 2020, 2020, : 5036 - 5040
[10] Hahn Sangchul, 2019, ARXIV190801851, P2

← 1 2 3 4 5 →