Knowledge Transfer and Distillation from Autoregressive to Non-Autoregressive Speech Recognition

被引：1

作者：

Gong, Xun ^{[1
]}

Zhou, Zhikai ^{[1
]}

Qian, Yanmin ^{[1
]}

机构：

[1] Shanghai Jiao Tong Univ, MoE Key Lab Artificial Intelligence, AI Inst, X LANCE Lab,Dept Comp Sci & Engn, Shanghai, Peoples R China

来源：

INTERSPEECH 2022 | 2022年

关键词：

knowledge transfer; knowledge distillation; nonautoregressive; end-to-end; speech recognition;

D O I：

10.21437/Interspeech.2022-632

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Modern non-autoregressive (NAR) speech recognition systems aim to accelerate the inference speed; however, they suffer from performance degradation compared with autoregressive (AR) models as well as the huge model size issue. We propose a novel knowledge transfer and distillation architecture that leverages knowledge from AR models to improve the NAR performance while reducing the model's size. Frame- and sequence-level objectives are well-designed for transfer learning. To further boost the performance of NAR, a beam search method on Mask-CTC is developed to enlarge the search space during the inference stage. Experiments show that the proposed NAR beam search relatively reduces CER by over 5% on AISHELL-1 benchmark with a tolerable real-time-factor (RTF) increment. By knowledge transfer, the NAR student who has the same size as the AR teacher obtains relative CER reductions of 8/16% on AISHELL-1 dev/test sets, and over 25% relative WER reductions on LibriSpeech test-clean/other sets. Moreover, the similar to 9x smaller NAR models achieve similar to 25% relative CER/WER reductions on both AISHELL-1 and LibriSpeech benchmarks with the proposed knowledge transfer and distillation.

引用

页码：2618 / 2622

页数：5

共 23 条

[1] Bacvski Alexei, 2020, Advances in neural information processing systems, V33, P12449, DOI DOI 10.48550/ARXIV.2006.11477
[2] Battenberg E, 2017, 2017 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), P206, DOI 10.1109/ASRU.2017.8268937
[3] Bu H, 2017, 2017 20TH CONFERENCE OF THE ORIENTAL CHAPTER OF THE INTERNATIONAL COORDINATING COMMITTEE ON SPEECH DATABASES AND SPEECH I/O SYSTEMS AND ASSESSMENT (O-COCOSDA), P58, DOI 10.1109/ICSDA.2017.8384449
[4] Chi EA, 2021, 2021 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL-HLT 2021), P1920
[5] Deng K., 2022, ARXIV220110103CSEESS
[6] Fan R., 2021, P INT 2021
[7] Ghazvininejad M, 2019, 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019), P6112
[8] Knowledge Distillation: A Survey
Gou, Jianping
Yu, Baosheng
Maybank, Stephen J.
Tao, Dacheng
[J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2021, 129 (06) : 1789 - 1819
[9] Graves A., 2006, MACHINE LEARNING P 2, P369, DOI [DOI 10.1145/1143844.1143891, 10.1145/1143844.1143891]
[10] Conformer: Convolution-augmented Transformer for Speech Recognition
Gulati, Anmol
Qin, James
Chiu, Chung-Cheng
Parmar, Niki
Zhang, Yu
Yu, Jiahui
Han, Wei
Wang, Shibo
Zhang, Zhengdong
Wu, Yonghui
Pang, Ruoming
[J]. INTERSPEECH 2020, 2020, : 5036 - 5040

← 1 2 3 →