A Novel Loss Function and Training Strategy for Noise-Robust Keyword Spotting

被引:18
作者
Lopez-Espejo, Ivan [1 ]
Tan, Zheng-Hua [1 ]
Jensen, Jesper [1 ,2 ]
机构
[1] Aalborg Univ, Dept Elect Syst, DK-9220 Aalborg, Denmark
[2] Oticon AS, DK-2765 Smorum, Denmark
关键词
Keyword spotting; noise robustness; multi-condition training; deep metric learning; loss function; keyword embedding; RECOGNITION;
D O I
10.1109/TASLP.2021.3092567
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
The development of keyword spotting (KWS) systems that are accurate in noisy conditions remains a challenge. Towards this goal, in this paper we propose a novel training strategy relying on multi-condition training for noise-robustKWS. By this strategy, we think of the state-of-the-art KWS models as the composition of a keyword embedding extractor and a linear classifier that are successively trained. To train the keyword embedding extractor, we also propose a new (C-N,C- 2 + 1)-pair loss function extending the concept behind related loss functions like triplet and N-pair losses to reach larger inter-class and smaller intra-class variation. Experimental results on a noisy version of the Google Speech Commands Dataset show that our proposal achieves around 12% KWS accuracy relative improvement with respect to standard end-to-end multi-condition training when speech is distorted by unseen noises. This performance improvement is achieved without increasing the computational complexity of the KWS model.
引用
收藏
页码:2254 / 2266
页数:13
相关论文
共 54 条
[1]  
Abadi M., 2015, P 12 USENIX S OPERAT
[2]   Convolutional Recurrent Neural Networks for Small-Footprint Keyword Spotting [J].
Arik, Sercan O. ;
Kliegl, Markus ;
Child, Rewon ;
Hestness, Joel ;
Gibiansky, Andrew ;
Fougner, Chris ;
Prenger, Ryan ;
Coates, Adam .
18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, :1606-1610
[3]   The third 'CHIME' speech separation and recognition challenge: Analysis and outcomes [J].
Barker, Jon ;
Marxer, Ricard ;
Vincent, Emmanuel ;
Watanabe, Shinji .
COMPUTER SPEECH AND LANGUAGE, 2017, 46 :605-626
[4]  
Barker J, 2015, 2015 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING (ASRU), P504, DOI 10.1109/ASRU.2015.7404837
[5]   CONFIDENCE-INTERVALS BASED ON ONE OR MORE OBSERVATIONS [J].
BLACHMAN, NM ;
MACHOL, RE .
IEEE TRANSACTIONS ON INFORMATION THEORY, 1987, 33 (03) :373-382
[6]  
Bridle J. S., 1990, Neurocomputing, Algorithms, Architectures and Applications. Proceedings of the NATO Advanced Research Workshop, P227
[7]   Beyond triplet loss: a deep quadruplet network for person re-identification [J].
Chen, Weihua ;
Chen, Xiaotang ;
Zhang, Jianguo ;
Huang, Kaiqi .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :1320-1329
[8]   Temporal Convolution for Real-time Keyword Spotting on Mobile Devices [J].
Choi, Seungwoo ;
Seo, Seokjun ;
Shin, Beomjun ;
Byun, Hyeongmin ;
Kersner, Martin ;
Kim, Beomsu ;
Kim, Dongyoung ;
Ha, Sungjoo .
INTERSPEECH 2019, 2019, :3372-3376
[9]  
Chollet F., 2015, Keras
[10]   Learning a similarity metric discriminatively, with application to face verification [J].
Chopra, S ;
Hadsell, R ;
LeCun, Y .
2005 IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, VOL 1, PROCEEDINGS, 2005, :539-546