FOCAL LOSS AND DOUBLE-EDGE-TRIGGERED DETECTOR FOR ROBUST SMALL-FOOTPRINT KEYWORD SPOTTING

被引：0

作者：

Liu, Bin ^{[1
,2
]}

Nie, Shuai ^{[1
]}

Zhang, Yaping ^{[1
,2
]}

Liang, Shan ^{[1
]}

Yang, Zhanlei ^{[1
]}

Liu, Wenju ^{[1
]}

机构：

[1] Chinese Acad Sci, Inst Automat, Natl Lab Patten Recognit, Beijing, Peoples R China

[2] Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing, Peoples R China

来源：

2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2019年

关键词：

keyword spotting; focal loss; double-edge-triggered detecting method; speech recognition;

D O I：

暂无

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Keyword spotting (KWS) system constitutes a critical component of human-computer interfaces, which detects the specific keyword from a continuous stream of audio. The goal of KWS is providing a high detection accuracy at a low false alarm rate while having small memory and computation requirements. The DNN-based KWS system faces a large class imbalance during training because the amount of data available for the keyword is usually much less than the background speech, which overwhelms training and leads to a degenerate model. In this paper, we explore the focal loss for the training of a small-footprint KWS system. It can automatically down-weight the contribution of easy samples during training and focus the model on hard samples, which naturally solves the class imbalance and allows us to efficiently utilize all data available. Furthermore, many keywords of Chinese conversational assistants are repeated words due to the idiomatic usage, such as 'XIAO DU XIAO DU'. We propose a double-edge-triggered detecting method for the repeated keyword, which significantly reduces the false alarm rate relative to the single threshold method. Systematic experiments demonstrate significant further improvements compared to the baseline system.

引用

页码：6361 / 6365

页数：5

共 20 条

[1] [Anonymous], Single Shot MultiBox Detector, DOI DOI 10.1007/978-3-319-46448-0_2
[2] [Anonymous], 2015, Compressing deep neural networks using a rank-constrained topology
[3] [Anonymous], 2015, 16 ANN C INT SPEECH
[4] [Anonymous], THESIS
[5] [Anonymous], 2012, P INTERSPEECH
[6] [Anonymous], LOSS MAX POOLING SEM
[7] Focal Loss for Dense Object Detection
Lin, Tsung-Yi
Goyal, Priya
Girshick, Ross
He, Kaiming
Dollar, Piotr
[J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 2999 - 3007
[8] [Anonymous], 2011, AUTOMATIC GAIN CONTR
[9] Guoguo Chen, 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), P4087, DOI 10.1109/ICASSP.2014.6854370
[10] An Adaptive Multi-Band System for Low Power Voice Command Recognition
He, Qing
Wornell, Gregory W.
Ma, Wei
[J]. 17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 1888 - 1892

← 1 2 →