FitHuBERT: Going Thinner and Deeper for Knowledge Distillation of Speech Self-Supervised Learning

被引：14

作者：

Lee, Yeonghyeon ^{[1
]}

Jang, Kangwook ^{[1
]}

Goo, Jahyun ^{[1
]}

Jung, Youngmoon ^{[1
]}

Kim, Hoirin ^{[1
]}

机构：

[1] Korea Adv Inst Sci & Technol, Sch Elect Engn, Daejeon, South Korea

来源：

INTERSPEECH 2022 | 2022年

基金：

新加坡国家研究基金会;

关键词：

knowledge distillation; speech representation learning; self-supervised learning; model compression;

D O I：

10.21437/Interspeech.2022-11112

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Large-scale speech self-supervised learning (SSL) has emerged to the main field of speech processing, however, the problem of computational cost arising from its vast size makes a high entry barrier to academia. In addition, existing distillation techniques of speech SSL models compress the model by reducing layers, which induces performance degradation in linguistic pattern recognition tasks such as phoneme recognition (PR). In this paper, we propose FitHuBERT, which makes thinner in dimension throughout almost all model components and deeper in layer compared to prior speech SSL distillation works. Moreover, we employ a time-reduction layer to speed up inference time and propose a method of hint-based distillation for less performance degradation. Our method reduces the model to 23.8% in size and 35.9% in inference time compared to HuBERT. Also, we achieve 12.1% word error rate and 13.3% phoneme error rate on the SUPERB benchmark which is superior than prior work.

引用

页码：3588 / 3592

页数：5

共 30 条

[1] Bacvski Alexei, 2020, Advances in neural information processing systems, V33, P12449, DOI DOI 10.48550/ARXIV.2006.11477
[2] Chan W, 2016, INT CONF ACOUST SPEE, P4960, DOI 10.1109/ICASSP.2016.7472621
[3] Chang Heng-Jui, 2021, ARXIV211001900
[4] Chang Xuankai, 2021, ARXIV211004590
[5] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[6] Falcon William, 2019, GitHub, V3, P6
[7] Fan Z., 2020, ARXIV201206185
[8] Transformer-based ASR Incorporating Time-reduction Layer and Fine-tuning with Self-Knowledge Distillation
Haidar, Md Akmal
Xing, Chao
Rezagholizadeh, Mehdi
[J]. INTERSPEECH 2021, 2021, : 2102 - 2106
[9] Advances in Joint CTC-Attention based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM
Hari, Takaaki
Watanabe, Shinji
Zhang, Yu
Chan, William
[J]. 18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 949 - 953
[10] HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units
Hsu, Wei-Ning
Bolte, Benjamin
Tsai, Yao-Hung Hubert
Lakhotia, Kushal
Salakhutdinov, Ruslan
Mohamed, Abdelrahman
[J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 : 3451 - 3460

← 1 2 3 →