A Lightweight Architecture for Query-by-Example Keyword Spotting on Low-Power IoT Devices

被引：12

作者：

Li, Meirong ^{[1
]}

机构：

[1] Xian Aeronaut Univ, Sch Comp Sci, Xian 710077, Peoples R China

来源：

IEEE TRANSACTIONS ON CONSUMER ELECTRONICS | 2023年 / 69卷 / 01期

关键词：

Feature extraction; Internet of Things; Computer architecture; Neural networks; Keyword search; Task analysis; Recurrent neural networks; Keyword spotting; convolutional recurrent neural network; model compression; segmental local normalized DTW algorithm; SMALL-FOOTPRINT; NEURAL-NETWORK;

D O I：

10.1109/TCE.2022.3213075

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Keyword spotting (KWS) is a task to recognize a keyword or a particular command in a continuous audio stream, which can be effectively applied to a voice trigger system that automatically monitors and processes speech signals. This paper focuses on the problem of user-defined keyword spotting in low-resource settings. A lightweight neural network architecture is developed for tackling the keyword detection task using query-by-example (QbyE) techniques. The architecture uses a convolutional recurrent neural network (CRNN) to extract the frame-level features of input audio signals. A customized model compression method is proposed to compress the network, making it suitable for low power settings. In the keyword enrollment, all enrolled keyword examples are merged to generate a single keyword template, which is responsible for detecting a target keyword in keyword search. To improve the efficiency of keyword searching, a segmental local normalized DTW algorithm is introduced. Experiments on the real-world collected datasets show that our approach consistently outperforms the state-of-the-art methods, and the proposed system can run on an ARM Cortex-A7 processor and achieve real-time keyword detection.

引用

页码：65 / 75

页数：11

共 47 条

[1] Robust Tri-Modal Automatic Speech Recognition for Consumer Applications [J].

Anderson, Steven J. ;

Fong, A. C. M. ;

Tang, Jie .

IEEE TRANSACTIONS ON CONSUMER ELECTRONICS, 2013, 59 (02) :352-360

[2]

Chang SY, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P5549, DOI 10.1109/ICASSP.2018.8461921

[3] Real-Time Speech Emotion Analysis for Smart Home Assistants [J].

Chatterjee, Rajdeep ;

Mazumdar, Saptarshi ;

Sherratt, R. Simon ;

Halder, Rohit ;

Maitra, Tanmoy ;

Giri, Debasis .

IEEE TRANSACTIONS ON CONSUMER ELECTRONICS, 2021, 67 (01) :68-76

[4]

Chen GG, 2014, INT CONF ACOUST SPEE

[5]

Chen GG, 2015, INT CONF ACOUST SPEE, P5236, DOI 10.1109/ICASSP.2015.7178970

[6] Unsupervised Bottleneck Features for Low-Resource Query-by-Example Spoken Term Detection [J].

Chen, Hongjie ;

Leung, Chewing-Chi ;

Xie, Lei ;

Ma, Bin ;

Lie, Haizhou .

17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, :923-927

[7] Audio Word2vec: Sequence-to-Sequence Autoencoding for Unsupervised Learning of Audio Segmentation and Representation [J].

Chen, Yi-Chen ;

Huang, Sung-Feng ;

Lee, Hung-yi ;

Wang, Yu-Hsuan ;

Shen, Chia-Hao .

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2019, 27 (09) :1481-1493

[8]

Du JY, 2018, Arxiv, DOI arXiv:1808.10583

[9] UNIFIED SPECULATION, DETECTION, AND VERIFICATION KEYWORD SPOTTING [J].

Fu, Geng-Shen ;

Senechal, Thibaud ;

Challenner, Aaron ;

Zhang, Tao .

2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, :7557-7561

[10]

Google Brain, 2017, TENSORFLOW POSTTR QU

← 1 2 3 4 5 →