End-to-End Speech Keyword Spotting Training Method Based on Sample's Class Uncertainty

被引：0

作者：

He, Qian-Hua ^{[1
]}

Chen, Yong-Qiang ^{[1
]}

Zheng, Ruo-Wei ^{[1
]}

Huang, Jin-Xin ^{[1
]}

机构：

[1] School of Electronic and Information Engineering, South China University of Technology, Guangdong, Guangzhou

来源：

Tien Tzu Hsueh Pao/Acta Electronica Sinica | 2024年 / 52卷 / 10期

基金：

中国国家自然科学基金;

关键词：

class uncertainty sampling; deep learning; end-to-end; speech keyword spotting;

D O I：

10.12263/DZXB.20240048

中图分类号：

学科分类号：

摘要：

End-to-end deep learning is the main technology for speech keyword spotting. The research focused on exploring better network structures, modeling units, and search strategies, and have made a lot of progress. However, less attention is paid on training efficiency. In this paper, a novel class uncertainty sampling (CUS) strategy is proposed to select effective samples for each training epoch. Since only a subset is used, much training time is saved. The core idea of CUS is measuring the class uncertainty of samples with the forward information of the output layer during the middle and late training stages, and samples are selected at a probability of their class uncertainty. Therefore more attention is paid to samples nearing the decision boundary, which are prone to missed detection or false alarm. Furthermore, the proposed method could shield the interference of label error samples. Experimental results on the AISHELL-1 Mandarin dataset showed that fast convergence and better training performance were achieved. Against the conventional training strategy, the average training time and the average converging time was relatively shortened by 60% and 47.5%, respectively. At 0.5 FP/h false accept rate (FAR), the false reject rate (FRR) was reduced from 4.75% to 3.65%, a relative reduction of 30.1%, and the maximum term weighted value (MTWV) was increased from 0.837 4 to 0.853 1. Moreover, it was experimentally verified that the method could shield most of the mislabeled samples. This conclusion was confirmed with the experiments on the large-scale AISHELL-2 Mandarin dataset. © 2024 Chinese Institute of Electronics. All rights reserved.

引用

页码：3482 / 3492

页数：10

共 27 条

[1] LIU W C, HUANG Z H, WANG D F., Keyword spotting based on efficient neural architecture search, 2023 IEEE 4th International Conference on Pattern Recognition and Machine Learning (PRML), pp. 432-436, (2023)
[2] KUROKAWA T, KAI A., Robust query-by-example spoken term detection for unknown words using speech retrieval-oriented E2E ASR modeling, 2021 IEEE 10th Global Conference on Consumer Electronics (GCCE), pp. 316-317, (2021)
[3] NA Y Y, WANG Z T, WANG L, Et al., Joint ego-noise suppression and keyword spotting on sweeping robots, ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7547-7551, (2022)
[4] LI M R., A lightweight architecture for query-by-example keyword spotting on low-power IoT devices, IEEE Transactions on Consumer Electronics, 69, 1, pp. 65-75, (2023)
[5] TIAN Y H, HE Q H, ZHENG R W, Et al., Spoken term detection based on feature space trajectory information, Acta Electronica Sinica, 51, 10, pp. 2915-2924, (2023)
[6] CHEN G G, PARADA C, HEIGOLD G., Small-footprint keyword spotting using deep neural networks, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4087-4091, (2014)
[7] TIAN Y, YAO H T, CAI M, Et al., Improving RNN transducer modeling for small-footprint keyword spotting, ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5624-5628, (2021)
[8] PETER D, ROTH W, PERNKOPF F., End-to-end keyword spotting using neural architecture search and quantization, ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3423-3427, (2022)
[9] CHOROWSKI J K, BAHDANAU D, SERDYUK D, Et al., Attention-based models for speech recognition, 28th International Conference on Neural Information Processing Systems (NIPS), pp. 577-585, (2015)
[10] GRAVES A, FERNANDEZ S, GOMEZ F, Et al., Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks, Proceedings of the 23rd international conference on Machine learning-ICML, (2006)

← 1 2 3 →