A Novel Loss Function and Training Strategy for Noise-Robust Keyword Spotting

被引：18

作者：

Lopez-Espejo, Ivan ^{[1
]}

Tan, Zheng-Hua ^{[1
]}

Jensen, Jesper ^{[1
,2
]}

机构：

[1] Aalborg Univ, Dept Elect Syst, DK-9220 Aalborg, Denmark

[2] Oticon AS, DK-2765 Smorum, Denmark

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2021年 / 29卷

关键词：

Keyword spotting; noise robustness; multi-condition training; deep metric learning; loss function; keyword embedding; RECOGNITION;

D O I：

10.1109/TASLP.2021.3092567

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

The development of keyword spotting (KWS) systems that are accurate in noisy conditions remains a challenge. Towards this goal, in this paper we propose a novel training strategy relying on multi-condition training for noise-robustKWS. By this strategy, we think of the state-of-the-art KWS models as the composition of a keyword embedding extractor and a linear classifier that are successively trained. To train the keyword embedding extractor, we also propose a new (C-N,C- 2 + 1)-pair loss function extending the concept behind related loss functions like triplet and N-pair losses to reach larger inter-class and smaller intra-class variation. Experimental results on a noisy version of the Google Speech Commands Dataset show that our proposal achieves around 12% KWS accuracy relative improvement with respect to standard end-to-end multi-condition training when speech is distorted by unseen noises. This performance improvement is achieved without increasing the computational complexity of the KWS model.

引用

页码：2254 / 2266

页数：13

共 54 条

[1]

Abadi M., 2015, P 12 USENIX S OPERAT

[2] Convolutional Recurrent Neural Networks for Small-Footprint Keyword Spotting [J].

Arik, Sercan O. ;

Kliegl, Markus ;

Child, Rewon ;

Hestness, Joel ;

Gibiansky, Andrew ;

Fougner, Chris ;

Prenger, Ryan ;

Coates, Adam .

18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, :1606-1610

[3] The third 'CHIME' speech separation and recognition challenge: Analysis and outcomes [J].

Barker, Jon ;

Marxer, Ricard ;

Vincent, Emmanuel ;

Watanabe, Shinji .

COMPUTER SPEECH AND LANGUAGE, 2017, 46 :605-626

[4]

Barker J, 2015, 2015 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING (ASRU), P504, DOI 10.1109/ASRU.2015.7404837

[5] CONFIDENCE-INTERVALS BASED ON ONE OR MORE OBSERVATIONS [J].

BLACHMAN, NM ;

MACHOL, RE .

IEEE TRANSACTIONS ON INFORMATION THEORY, 1987, 33 (03) :373-382

[6]

Bridle J. S., 1990, Neurocomputing, Algorithms, Architectures and Applications. Proceedings of the NATO Advanced Research Workshop, P227

[7] Beyond triplet loss: a deep quadruplet network for person re-identification [J].

Chen, Weihua ;

Chen, Xiaotang ;

Zhang, Jianguo ;

Huang, Kaiqi .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :1320-1329

[8] Temporal Convolution for Real-time Keyword Spotting on Mobile Devices [J].

Choi, Seungwoo ;

Seo, Seokjun ;

Shin, Beomjun ;

Byun, Hyeongmin ;

Kersner, Martin ;

Kim, Beomsu ;

Kim, Dongyoung ;

Ha, Sungjoo .

INTERSPEECH 2019, 2019, :3372-3376

[9]

Chollet F., 2015, Keras

[10] Learning a similarity metric discriminatively, with application to face verification [J].

Chopra, S ;

Hadsell, R ;

LeCun, Y .

2005 IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, VOL 1, PROCEEDINGS, 2005, :539-546

← 1 2 3 4 5 6 →