A Lightweight Architecture for Query-by-Example Keyword Spotting on Low-Power IoT Devices

被引:12
作者
Li, Meirong [1 ]
机构
[1] Xian Aeronaut Univ, Sch Comp Sci, Xian 710077, Peoples R China
关键词
Feature extraction; Internet of Things; Computer architecture; Neural networks; Keyword search; Task analysis; Recurrent neural networks; Keyword spotting; convolutional recurrent neural network; model compression; segmental local normalized DTW algorithm; SMALL-FOOTPRINT; NEURAL-NETWORK;
D O I
10.1109/TCE.2022.3213075
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Keyword spotting (KWS) is a task to recognize a keyword or a particular command in a continuous audio stream, which can be effectively applied to a voice trigger system that automatically monitors and processes speech signals. This paper focuses on the problem of user-defined keyword spotting in low-resource settings. A lightweight neural network architecture is developed for tackling the keyword detection task using query-by-example (QbyE) techniques. The architecture uses a convolutional recurrent neural network (CRNN) to extract the frame-level features of input audio signals. A customized model compression method is proposed to compress the network, making it suitable for low power settings. In the keyword enrollment, all enrolled keyword examples are merged to generate a single keyword template, which is responsible for detecting a target keyword in keyword search. To improve the efficiency of keyword searching, a segmental local normalized DTW algorithm is introduced. Experiments on the real-world collected datasets show that our approach consistently outperforms the state-of-the-art methods, and the proposed system can run on an ARM Cortex-A7 processor and achieve real-time keyword detection.
引用
收藏
页码:65 / 75
页数:11
相关论文
共 47 条
[1]   Robust Tri-Modal Automatic Speech Recognition for Consumer Applications [J].
Anderson, Steven J. ;
Fong, A. C. M. ;
Tang, Jie .
IEEE TRANSACTIONS ON CONSUMER ELECTRONICS, 2013, 59 (02) :352-360
[2]  
Chang SY, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P5549, DOI 10.1109/ICASSP.2018.8461921
[3]   Real-Time Speech Emotion Analysis for Smart Home Assistants [J].
Chatterjee, Rajdeep ;
Mazumdar, Saptarshi ;
Sherratt, R. Simon ;
Halder, Rohit ;
Maitra, Tanmoy ;
Giri, Debasis .
IEEE TRANSACTIONS ON CONSUMER ELECTRONICS, 2021, 67 (01) :68-76
[4]  
Chen GG, 2014, INT CONF ACOUST SPEE
[5]  
Chen GG, 2015, INT CONF ACOUST SPEE, P5236, DOI 10.1109/ICASSP.2015.7178970
[6]   Unsupervised Bottleneck Features for Low-Resource Query-by-Example Spoken Term Detection [J].
Chen, Hongjie ;
Leung, Chewing-Chi ;
Xie, Lei ;
Ma, Bin ;
Lie, Haizhou .
17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, :923-927
[7]   Audio Word2vec: Sequence-to-Sequence Autoencoding for Unsupervised Learning of Audio Segmentation and Representation [J].
Chen, Yi-Chen ;
Huang, Sung-Feng ;
Lee, Hung-yi ;
Wang, Yu-Hsuan ;
Shen, Chia-Hao .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2019, 27 (09) :1481-1493
[8]  
Du JY, 2018, Arxiv, DOI arXiv:1808.10583
[9]   UNIFIED SPECULATION, DETECTION, AND VERIFICATION KEYWORD SPOTTING [J].
Fu, Geng-Shen ;
Senechal, Thibaud ;
Challenner, Aaron ;
Zhang, Tao .
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, :7557-7561
[10]  
Google Brain, 2017, TENSORFLOW POSTTR QU