FitHuBERT: Going Thinner and Deeper for Knowledge Distillation of Speech Self-Supervised Learning

被引:14
作者
Lee, Yeonghyeon [1 ]
Jang, Kangwook [1 ]
Goo, Jahyun [1 ]
Jung, Youngmoon [1 ]
Kim, Hoirin [1 ]
机构
[1] Korea Adv Inst Sci & Technol, Sch Elect Engn, Daejeon, South Korea
来源
INTERSPEECH 2022 | 2022年
基金
新加坡国家研究基金会;
关键词
knowledge distillation; speech representation learning; self-supervised learning; model compression;
D O I
10.21437/Interspeech.2022-11112
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Large-scale speech self-supervised learning (SSL) has emerged to the main field of speech processing, however, the problem of computational cost arising from its vast size makes a high entry barrier to academia. In addition, existing distillation techniques of speech SSL models compress the model by reducing layers, which induces performance degradation in linguistic pattern recognition tasks such as phoneme recognition (PR). In this paper, we propose FitHuBERT, which makes thinner in dimension throughout almost all model components and deeper in layer compared to prior speech SSL distillation works. Moreover, we employ a time-reduction layer to speed up inference time and propose a method of hint-based distillation for less performance degradation. Our method reduces the model to 23.8% in size and 35.9% in inference time compared to HuBERT. Also, we achieve 12.1% word error rate and 13.3% phoneme error rate on the SUPERB benchmark which is superior than prior work.
引用
收藏
页码:3588 / 3592
页数:5
相关论文
共 30 条
  • [1] Bacvski Alexei, 2020, Advances in neural information processing systems, V33, P12449, DOI DOI 10.48550/ARXIV.2006.11477
  • [2] Chan W, 2016, INT CONF ACOUST SPEE, P4960, DOI 10.1109/ICASSP.2016.7472621
  • [3] Chang Heng-Jui, 2021, ARXIV211001900
  • [4] Chang Xuankai, 2021, ARXIV211004590
  • [5] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
  • [6] Falcon William, 2019, GitHub, V3, P6
  • [7] Fan Z., 2020, ARXIV201206185
  • [8] Transformer-based ASR Incorporating Time-reduction Layer and Fine-tuning with Self-Knowledge Distillation
    Haidar, Md Akmal
    Xing, Chao
    Rezagholizadeh, Mehdi
    [J]. INTERSPEECH 2021, 2021, : 2102 - 2106
  • [9] Advances in Joint CTC-Attention based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM
    Hari, Takaaki
    Watanabe, Shinji
    Zhang, Yu
    Chan, William
    [J]. 18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 949 - 953
  • [10] HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units
    Hsu, Wei-Ning
    Bolte, Benjamin
    Tsai, Yao-Hung Hubert
    Lakhotia, Kushal
    Salakhutdinov, Ruslan
    Mohamed, Abdelrahman
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 : 3451 - 3460