Stacked 1D convolutional networks for end-to-end small footprint voice trigger detection

被引:10
作者
Higuchi, Takuya [1 ]
Ghasemzadeh, Mohammad [1 ]
You, Kisun [1 ]
Dhir, Chandra [1 ]
机构
[1] Apple, Cupertino, CA 95014 USA
来源
INTERSPEECH 2020 | 2020年
关键词
small footprint voice trigger detection; singular value decomposition filter; convolutional neural network;
D O I
10.21437/Interspeech.2020-2763
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
We propose a stacked 1D convolutional neural network (S1DCNN) for end-to-end small footprint voice trigger detection in a streaming scenario. Voice trigger detection is an important speech application, with which users can activate their devices by simply saying a keyword or phrase. Due to privacy and latency reasons, a voice trigger detection system should run on an always-on processor on device. Therefore, having small memory and compute cost is crucial for a voice trigger detection system. Recently, singular value decomposition filters (SVDFs) has been used for end-to-end voice trigger detection. The SVDFs approximate a fully-connected layer with a low rank approximation, which reduces the number of model parameters. In this work, we propose S1DCNN as an alternative approach for end-to-end small-footprint voice trigger detection. An S1DCNN layer consists of a 1D convolution layer followed by a depth-wise 1D convolution layer. We show that the SVDF can be expressed as a special case of the S1DCNN layer. Experimental results show that the S1DCNN achieve 19.0% relative false reject ratio (FRR) reduction with a similar model size and a similar time delay compared to the SVDF. By using longer time delays, the S1DCNN further improve the FRR up to 12.2% relative.
引用
收藏
页码:2592 / 2596
页数:5
相关论文
共 18 条
  • [11] Multi-task learning and Weighted Cross-entropy for DNN-based Keyword Spotting
    Panchapagesan, Sankaran
    Sun, Ming
    Khare, Aparna
    Mandal, Spyros Matsoukas Arindam
    Hoffineister, Bjorn
    Vitaladevuni, Shiv
    [J]. 17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 760 - 764
  • [12] Prabhavalkar R, 2015, INT CONF ACOUST SPEE, P4704, DOI 10.1109/ICASSP.2015.7178863
  • [13] S. Team, 2017, APPLE MACHINE LEARNI, V1
  • [14] Sainath TN, 2015, 16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, P1
  • [15] Sigtia S, 2018, INTERSPEECH, P2092
  • [16] Compressed time delay neural network for small-footprint keyword spotting
    Sun, Ming
    Snyder, David
    Gao, Yixin
    Nagaraja, Varun
    Rodehorst, Mike
    Panchapagesan, Sankaran
    Strom, Nikko
    Matsoukas, Spyros
    Vitaladevuni, Shiv
    [J]. 18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 3607 - 3611
  • [17] Model compression applied to small-footprint keyword spotting
    Tucker, George
    Wu, Minhua
    Sun, Ming
    Panchapagesan, Sankaran
    Fu, Gengshen
    Vitaladevuni, Shiv
    [J]. 17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 1878 - 1882
  • [18] Wu MH, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P5494, DOI 10.1109/ICASSP.2018.8462227