Broadcasted Residual Learning for Efficient Keyword Spotting

被引:51
作者
Kim, Byeonggeun [1 ]
Chang, Simyung [1 ]
Lee, Jinkyu [1 ]
Sung, Dooyong [1 ]
机构
[1] Qualcomm Korea YH, Qualcomm AI Res, Seoul, South Korea
来源
INTERSPEECH 2021 | 2021年
关键词
keyword spotting; speech command recognition; deep neural network; efficient neural network; residual learning;
D O I
10.21437/Interspeech.2021-383
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Keyword spotting is an important research field because it plays a key role in device wake-up and user interaction on smart devices. However, it is challenging to minimize errors while operating efficiently in devices with limited resources such as mobile phones. We present a broadcasted residual learning method to achieve high accuracy with small model size and computational load. Our method configures most of the residual functions as 1D temporal convolution while still allows 2D convolution together using a broadcasted-residual connection that expands temporal output to frequency-temporal dimension. This residual mapping enables the network to effectively represent useful audio features with much less computation than conventional convolutional neural networks. We also propose a novel network architecture, Broadcasting-residual network (BC-ResNet), based on broadcasted residual learning and describe how to scale up the model according to the target device's resources. BC-ResNets achieve state-of-the-art 98:0% and 98:7% top-1 accuracy on Google speech command datasets v1 and v2, respectively, and consistently outperform previous approaches, using fewer computations and parameters.
引用
收藏
页码:4538 / 4542
页数:5
相关论文
共 29 条
[11]  
Kim B, 2019, 2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), P532, DOI [10.1109/asru46091.2019.9004014, 10.1109/ASRU46091.2019.9004014]
[12]  
Lee M, 2019, 2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), P86, DOI [10.1109/ASRU46091.2019.9003738, 10.1109/asru46091.2019.9003738]
[13]  
Lee S, 2020, AAAI CONF ARTIF INTE, V34, P4569
[14]   Small-Footprint Keyword Spotting with Multi-Scale Temporal Convolution [J].
Li, Ximin ;
Wei, Xiaodong ;
Qin, Xiaowei .
INTERSPEECH 2020, 2020, :1987-1991
[15]  
Lin J, 2020, INT CONF ACOUST SPEE, P7474, DOI [10.1109/ICASSP40776.2020.9053193, 10.1109/icassp40776.2020.9053193]
[16]  
Loshchilov Ilya, 2017, INT C LEARNING REPRE, DOI DOI 10.48550/ARXIV.1608.03983
[17]   MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition [J].
Majumdar, Somshubra ;
Ginsburg, Boris .
INTERSPEECH 2020, 2020, :3356-3360
[18]   SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition [J].
Park, Daniel S. ;
Chan, William ;
Zhang, Yu ;
Chiu, Chung-Cheng ;
Zoph, Barret ;
Cubuk, Ekin D. ;
Le, Quoc, V .
INTERSPEECH 2019, 2019, :2613-2617
[19]  
Ramachandran P., 2018, ICLR WORKSH OPENREVI
[20]  
Real E, 2019, AAAI CONF ARTIF INTE, P4780