Stacked 1D convolutional networks for end-to-end small footprint voice trigger detection

被引：10

作者：

Higuchi, Takuya ^{[1
]}

Ghasemzadeh, Mohammad ^{[1
]}

You, Kisun ^{[1
]}

Dhir, Chandra ^{[1
]}

机构：

[1] Apple, Cupertino, CA 95014 USA

来源：

INTERSPEECH 2020 | 2020年

关键词：

small footprint voice trigger detection; singular value decomposition filter; convolutional neural network;

D O I：

10.21437/Interspeech.2020-2763

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

We propose a stacked 1D convolutional neural network (S1DCNN) for end-to-end small footprint voice trigger detection in a streaming scenario. Voice trigger detection is an important speech application, with which users can activate their devices by simply saying a keyword or phrase. Due to privacy and latency reasons, a voice trigger detection system should run on an always-on processor on device. Therefore, having small memory and compute cost is crucial for a voice trigger detection system. Recently, singular value decomposition filters (SVDFs) has been used for end-to-end voice trigger detection. The SVDFs approximate a fully-connected layer with a low rank approximation, which reduces the number of model parameters. In this work, we propose S1DCNN as an alternative approach for end-to-end small-footprint voice trigger detection. An S1DCNN layer consists of a 1D convolution layer followed by a depth-wise 1D convolution layer. We show that the SVDF can be expressed as a special case of the S1DCNN layer. Experimental results show that the S1DCNN achieve 19.0% relative false reject ratio (FRR) reduction with a similar model size and a similar time delay compared to the SVDF. By using longer time delays, the S1DCNN further improve the FRR up to 12.2% relative.

引用

页码：2592 / 2596

页数：5

共 18 条

[1] Alvarez R, 2019, INT CONF ACOUST SPEE, P6336, DOI 10.1109/ICASSP.2019.8683557
[2] Arik S. O., 2017, ARXIV170305390
[3] Chen G., 2014, 2014 IEEE INT C ACOU, P4087
[4] Fernández S, 2007, LECT NOTES COMPUT SC, V4669, P220
[5] Gruenstein A., 2017, ARXIV171203603
[6] Guo JX, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P5489, DOI 10.1109/ICASSP.2018.8462166
[7] He YZ, 2017, 2017 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), P474, DOI 10.1109/ASRU.2017.8268974
[8] Kingma DP., 2017, A method for stochastic optimization, DOI DOI 10.48550/ARXIV.1412.6980
[9] Kumatani K, 2017, 2017 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), P252, DOI 10.1109/ASRU.2017.8268943
[10] Nakkiran P, 2015, 16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, P1473

← 1 2 →