A multimodel keyword spotting system based on lip movement and speech features

被引：4

作者：

Handa, Anand ^{[1
]}

Agarwal, Rashi ^{[2
]}

Kohli, Narendra ^{[3
]}

机构：

[1] Dr APJ Abdul Kalam Tech Univ, Dept CSE, Lucknow, Uttar Pradesh, India

[2] CSJM Univ, UIET, Dept Informat Technol, Kanpur, Uttar Pradesh, India

[3] HBTU, Dept Comp Sci & Engn, Kanpur, Uttar Pradesh, India

来源：

MULTIMEDIA TOOLS AND APPLICATIONS | 2020年 / 79卷 / 27-28期

关键词：

Keyword spotting and recognition; Convolutional neural networks; Lip movement and lip reading; Long short term memory; Speech analysis; RECOGNITION; IMAGE;

D O I：

10.1007/s11042-020-08837-2

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

The spoken keyword recognition and its localization are one of the fundamental aspects of speech recognition and known as keyword spotting. In automatic keyword spotting systems, the Lip-reading (LR) methods have a broader role when audio data is not present or has corrupted information. The available works from the literature have focussed on recognizing a limited number of words or phrases and require the cropped region of face or lip. Whereas the proposed model does not require the cropping of the video frames and it is recognition free. The proposed model is utilizing Convolutional Neural Networks and Long Short Term Memory networks to improve the overall performance. The model creates a 128-dimensional subspace to represent the feature vectors for speech signals and corresponding lip movements (focused viseme sequences). Thus the proposed model can tackle lip reading as an unconstrained natural speech signal in the video sequences. In the experiments, different standard datasets as LRW (Oxford-BBC), MIRACL-VC1, OuluVS, GRID, and CUAVE are used for the evaluation of the proposed model. The experiments also have a comparative analysis of the proposed model with current state-of-the-art methods for Lip-Reading task and keyword spotting task. The proposed model obtain excellent results for all datasets under consideration.

引用

页码：20461 / 20481

页数：21

共 45 条

[1] Crowdsourcing the creation of image segmentation algorithms for connectomics
Arganda-Carreras, Ignacio
Turaga, Srinivas C.
Berger, Daniel P.
Ciresan, Dan
Giusti, Alessandro
Gambardella, Luca M.
Schmidhuber, Juergen
Laptev, Dmitry
Dwivedi, Sarvesh
Buhmann, Joachim M.
Liu, Ting
Seyedhosseini, Mojtaba
Tasdizen, Tolga
Kamentsky, Lee
Burget, Radim
Uher, Vaclav
Tan, Xiao
Sun, Changming
Pham, Tuan D.
Bas, Erhan
Uzunbas, Mustafa G.
Cardona, Albert
Schindelin, Johannes
Seung, H. Sebastian
[J]. FRONTIERS IN NEUROANATOMY, 2015, 9 : 1 - 13
[2] MKPLS: Manifold Kernel Partial Least Squares for Lipreading and Speaker Identification
Bakry, Amr
Elgammal, Ahmed
[J]. 2013 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2013, : 684 - 691
[3] 3D modeling and tracking of human lip motions
Basu, S
Oliver, N
Pentland, A
[J]. SIXTH INTERNATIONAL CONFERENCE ON COMPUTER VISION, 1998, : 337 - 343
[4] Bourlard Herve A, 2012, Connectionist speech recognition: a hybrid approach, V247
[5] Lip Reading Sentences in the Wild
Chung, Joon Son
Senior, Andrew
Vinyals, Oriol
Zisserman, Andrew
[J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 3444 - 3450
[6] Lip Reading in the Wild
Chung, Joon Son
Zisserman, Andrew
[J]. COMPUTER VISION - ACCV 2016, PT II, 2017, 10112 : 87 - 103
[7] An audio-visual corpus for speech perception and automatic speech recognition (L)
Cooke, Martin
Barker, Jon
Cunningham, Stuart
Shao, Xu
[J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2006, 120 (05) : 2421 - 2424
[8] Cox S. J., 2008, The challenge of multispeaker lip-reading, P179
[9] Multi-pose lipreading and audio-visual speech recognition
Estellers, Virginia
Thiran, Jean-Philippe
[J]. EURASIP JOURNAL ON ADVANCES IN SIGNAL PROCESSING, 2012, : 1 - 23
[10] Galatas G, 2012, EUR SIGNAL PR CONF, P2714

← 1 2 3 4 5 →