Deep Spoken Keyword Spotting: An Overview

被引:63
作者
Lopez-Espejo, Ivan [1 ]
Tan, Zheng-Hua [1 ]
Hansen, John H. L. [2 ]
Jensen, Jesper [1 ,3 ]
机构
[1] Aalborg Univ, Dept Elect Syst, DK-9220 Aalborg, Denmark
[2] Univ Texas Dallas, Erik Jonsson Sch Engn & Comp Sci, Richardson, TX 75080 USA
[3] Oticon AS, DK-2765 Smorum, Denmark
关键词
Hidden Markov models; Acoustics; Feature extraction; Decoding; Computational modeling; Viterbi algorithm; Virtual assistants; Keyword spotting; deep learning; acoustic model; small footprint; robustness; ADAPTIVE NOISE CANCELLATION; SMALL-FOOTPRINT; SPEECH RECOGNITION; SPEAKER VERIFICATION; TERM DETECTION; ROBUST; REPRESENTATIONS; ENHANCEMENT; CHALLENGE; ATTENTION;
D O I
10.1109/ACCESS.2021.3139508
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Spoken keyword spotting (KWS) deals with the identification of keywords in audio streams and has become a fast-growing technology thanks to the paradigm shift introduced by deep learning a few years ago. This has allowed the rapid embedding of deep KWS in a myriad of small electronic devices with different purposes like the activation of voice assistants. Prospects suggest a sustained growth in terms of social use of this technology. Thus, it is not surprising that deep KWS has become a hot research topic among speech scientists, who constantly look for KWS performance improvement and computational complexity reduction. This context motivates this paper, in which we conduct a literature review into deep spoken KWS to assist practitioners and researchers who are interested in this technology. Specifically, this overview has a comprehensive nature by covering a thorough analysis of deep KWS systems (which includes speech features, acoustic modeling and posterior handling), robustness methods, applications, datasets, evaluation metrics, performance of deep KWS systems and audio-visual KWS. The analysis performed in this paper allows us to identify a number of directions for future research, including directions adopted from automatic speech recognition research and directions that are unique to the problem of spoken KWS.
引用
收藏
页码:4169 / 4199
页数:31
相关论文
共 260 条
[1]  
Albert ET, 2019, INT C INTELL COMP CO, P53, DOI [10.1109/ICCP48234.2019.8959645, 10.1109/iccp48234.2019.8959645]
[2]  
Alvarez R, 2019, INT CONF ACOUST SPEE, P6336, DOI 10.1109/ICASSP.2019.8683557
[3]  
An S, 2019, INTERSPEECH, P3661
[4]  
[Anonymous], 1990, Neurocomputing: Algorithms, architectures and applications, DOI DOI 10.1007/978-3-642-76153-928
[5]  
[Anonymous], 1993, NASA STI/Recon Technical Report N
[6]  
[Anonymous], 2021, KEYWORD RECOGNITION
[7]  
[Anonymous], 2001, IEEE Data Eng. Bull.
[8]  
[Anonymous], 2013, OpenKWS13 Keyword Search Evaluation Plan
[9]  
[Anonymous], 2006, P 23 INT C MACH LEAR, DOI 10.1145/1143844.1143891
[10]   Convolutional Recurrent Neural Networks for Small-Footprint Keyword Spotting [J].
Arik, Sercan O. ;
Kliegl, Markus ;
Child, Rewon ;
Hestness, Joel ;
Gibiansky, Andrew ;
Fougner, Chris ;
Prenger, Ryan ;
Coates, Adam .
18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, :1606-1610