End-to-End Speech Keyword Spotting Training Method Based on Sample's Class Uncertainty

被引:0
|
作者
He, Qian-Hua [1 ]
Chen, Yong-Qiang [1 ]
Zheng, Ruo-Wei [1 ]
Huang, Jin-Xin [1 ]
机构
[1] School of Electronic and Information Engineering, South China University of Technology, Guangdong, Guangzhou
来源
Tien Tzu Hsueh Pao/Acta Electronica Sinica | 2024年 / 52卷 / 10期
基金
中国国家自然科学基金;
关键词
class uncertainty sampling; deep learning; end-to-end; speech keyword spotting;
D O I
10.12263/DZXB.20240048
中图分类号
学科分类号
摘要
End-to-end deep learning is the main technology for speech keyword spotting. The research focused on exploring better network structures, modeling units, and search strategies, and have made a lot of progress. However, less attention is paid on training efficiency. In this paper, a novel class uncertainty sampling (CUS) strategy is proposed to select effective samples for each training epoch. Since only a subset is used, much training time is saved. The core idea of CUS is measuring the class uncertainty of samples with the forward information of the output layer during the middle and late training stages, and samples are selected at a probability of their class uncertainty. Therefore more attention is paid to samples nearing the decision boundary, which are prone to missed detection or false alarm. Furthermore, the proposed method could shield the interference of label error samples. Experimental results on the AISHELL-1 Mandarin dataset showed that fast convergence and better training performance were achieved. Against the conventional training strategy, the average training time and the average converging time was relatively shortened by 60% and 47.5%, respectively. At 0.5 FP/h false accept rate (FAR), the false reject rate (FRR) was reduced from 4.75% to 3.65%, a relative reduction of 30.1%, and the maximum term weighted value (MTWV) was increased from 0.837 4 to 0.853 1. Moreover, it was experimentally verified that the method could shield most of the mislabeled samples. This conclusion was confirmed with the experiments on the large-scale AISHELL-2 Mandarin dataset. © 2024 Chinese Institute of Electronics. All rights reserved.
引用
收藏
页码:3482 / 3492
页数:10
相关论文
共 27 条
  • [1] LIU W C, HUANG Z H, WANG D F., Keyword spotting based on efficient neural architecture search, 2023 IEEE 4th International Conference on Pattern Recognition and Machine Learning (PRML), pp. 432-436, (2023)
  • [2] KUROKAWA T, KAI A., Robust query-by-example spoken term detection for unknown words using speech retrieval-oriented E2E ASR modeling, 2021 IEEE 10th Global Conference on Consumer Electronics (GCCE), pp. 316-317, (2021)
  • [3] NA Y Y, WANG Z T, WANG L, Et al., Joint ego-noise suppression and keyword spotting on sweeping robots, ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7547-7551, (2022)
  • [4] LI M R., A lightweight architecture for query-by-example keyword spotting on low-power IoT devices, IEEE Transactions on Consumer Electronics, 69, 1, pp. 65-75, (2023)
  • [5] TIAN Y H, HE Q H, ZHENG R W, Et al., Spoken term detection based on feature space trajectory information, Acta Electronica Sinica, 51, 10, pp. 2915-2924, (2023)
  • [6] CHEN G G, PARADA C, HEIGOLD G., Small-footprint keyword spotting using deep neural networks, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4087-4091, (2014)
  • [7] TIAN Y, YAO H T, CAI M, Et al., Improving RNN transducer modeling for small-footprint keyword spotting, ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5624-5628, (2021)
  • [8] PETER D, ROTH W, PERNKOPF F., End-to-end keyword spotting using neural architecture search and quantization, ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3423-3427, (2022)
  • [9] CHOROWSKI J K, BAHDANAU D, SERDYUK D, Et al., Attention-based models for speech recognition, 28th International Conference on Neural Information Processing Systems (NIPS), pp. 577-585, (2015)
  • [10] GRAVES A, FERNANDEZ S, GOMEZ F, Et al., Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks, Proceedings of the 23rd international conference on Machine learning-ICML, (2006)