On-Device Constrained Self-Supervised Speech Representation Learning for Keyword Spotting via Knowledge Distillation

被引:0
作者
Yang, Gene-Ping [1 ]
Gu, Yue [2 ]
Tang, Qingming [2 ]
Du, Dongsu [2 ]
Liu, Yuzong [3 ]
机构
[1] Univ Edinburgh, Ctr Speech Technol Res, Edinburgh, Midlothian, Scotland
[2] Amazon, Alexa Perceptual Technol, Seattle, WA USA
[3] Zoom Video Commun Inc, San Jose, CA USA
来源
INTERSPEECH 2023 | 2023年
关键词
self-supervised learning; knowledge distillation; dual-view cross-correlation; keyword spotting; on-device;
D O I
10.21437/Interspeech.2023-2362
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Large self-supervised models are effective feature extractors, but their application is challenging under on-device budget constraints and biased dataset collection, especially in keyword spotting. To address this, we proposed a knowledge distillation-based self-supervised speech representation learning (S3RL) architecture for on-device keyword spotting. Our approach used a teacher-student framework to transfer knowledge from a larger, more complex model to a smaller, light-weight model using dual-view cross-correlation distillation and the teacher's codebook as learning objectives. We evaluated our model's performance on an Alexa keyword spotting detection task using a 16.6k-hour in-house dataset. Our technique showed exceptional performance in normal and noisy conditions, demonstrating the efficacy of knowledge distillation methods in constructing self-supervised models for keyword spotting tasks while working within on-device resource constraints.
引用
收藏
页码:1623 / 1627
页数:5
相关论文
共 30 条
  • [1] Baevski A., 2020, wav2vec 2.0: A framework for self-supervised learning of speech representations
  • [2] Chang H.-J., 2022, ICASSP
  • [3] Chen S., 2022, IEEE Journal of Selected Topics in Signal Processing (JSTSP)
  • [4] Chung Y.-A., 2021, ASRU
  • [5] An Unsupervised Autoregressive Model for Speech Representation Learning
    Chung, Yu-An
    Hsu, Wei-Ning
    Tang, Hao
    Glass, James
    [J]. INTERSPEECH 2019, 2019, : 146 - 150
  • [6] Hinton G., 2015, DISTILLING KNOWLEDGE
  • [7] Hsu W.-N., 2021, IEEE ACM T AUDIO SPE
  • [8] Huang K.-P., 2023, ARXIV230212757
  • [9] Jiang D., 2019, ARXIV191009932
  • [10] Jiang D., 2021, ICASSP