Temporal Convolution for Real-time Keyword Spotting on Mobile Devices

被引:91
作者
Choi, Seungwoo [1 ]
Seo, Seokjun [1 ]
Shin, Beomjun [1 ]
Byun, Hyeongmin [1 ]
Kersner, Martin [1 ]
Kim, Beomsu [1 ]
Kim, Dongyoung [1 ]
Ha, Sungjoo [1 ]
机构
[1] Hyperconnect, Seoul, South Korea
来源
INTERSPEECH 2019 | 2019年
关键词
keyword spotting; real-time; convolutional neural network; temporal convolution; mobile device;
D O I
10.21437/Interspeech.2019-1363
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Keyword spotting (KWS) plays a critical role in enabling speech-based user interactions on smart devices. Recent developments in the field of deep learning have led to wide adoption of convolutional neural networks (CNNs) in KWS systems due to their exceptional accuracy and robustness. The main challenge faced by KWS systems is the trade-off between high accuracy and low latency. Unfortunately, there has been little quantitative analysis of the actual latency of KWS models on mobile devices. This is especially concerning since conventional convolution-based KWS approaches are known to require a large number of operations to attain an adequate level of performance. In this paper, we propose a temporal convolution for real-time KWS on mobile devices. Unlike most of the 2D convolution-based KWS approaches that require a deep architecture to fully capture both low- and high-frequency domains, we exploit temporal convolutions with a compact ResNet architecture. In Google Speech Command Dataset, we achieve more than 385x speedup on Google Pixel 1 and surpass the accuracy compared to the state-of-the-art model. In addition, we release the implementation of the proposed and the baseline models including an end-to-end pipeline for training models and evaluating them on mobile devices.
引用
收藏
页码:3372 / 3376
页数:5
相关论文
共 22 条
[1]  
Abadi M, 2016, PROCEEDINGS OF OSDI'16: 12TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, P265
[2]  
[Anonymous], 2015, Arxiv.Org, DOI DOI 10.3389/FPSYG.2013.00124
[3]  
[Anonymous], 2018, ARXIV180806719
[4]  
Chen GG, 2014, INT CONF ACOUST SPEE
[5]  
Choi K., 2017, P IEEE INT C AC SPEE
[6]  
DEANDRADE DC, 2018, ARXIV180808929
[7]  
Howard AG, 2017, ARXIV
[8]  
Ioffe S, 2015, 32 INT C MACH LEARN
[9]  
Lim H., 2017, DETECTION CLASSIFICA
[10]   ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design [J].
Ma, Ningning ;
Zhang, Xiangyu ;
Zheng, Hai-Tao ;
Sun, Jian .
COMPUTER VISION - ECCV 2018, PT XIV, 2018, 11218 :122-138