A multimodel keyword spotting system based on lip movement and speech features

被引:4
作者
Handa, Anand [1 ]
Agarwal, Rashi [2 ]
Kohli, Narendra [3 ]
机构
[1] Dr APJ Abdul Kalam Tech Univ, Dept CSE, Lucknow, Uttar Pradesh, India
[2] CSJM Univ, UIET, Dept Informat Technol, Kanpur, Uttar Pradesh, India
[3] HBTU, Dept Comp Sci & Engn, Kanpur, Uttar Pradesh, India
关键词
Keyword spotting and recognition; Convolutional neural networks; Lip movement and lip reading; Long short term memory; Speech analysis; RECOGNITION; IMAGE;
D O I
10.1007/s11042-020-08837-2
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The spoken keyword recognition and its localization are one of the fundamental aspects of speech recognition and known as keyword spotting. In automatic keyword spotting systems, the Lip-reading (LR) methods have a broader role when audio data is not present or has corrupted information. The available works from the literature have focussed on recognizing a limited number of words or phrases and require the cropped region of face or lip. Whereas the proposed model does not require the cropping of the video frames and it is recognition free. The proposed model is utilizing Convolutional Neural Networks and Long Short Term Memory networks to improve the overall performance. The model creates a 128-dimensional subspace to represent the feature vectors for speech signals and corresponding lip movements (focused viseme sequences). Thus the proposed model can tackle lip reading as an unconstrained natural speech signal in the video sequences. In the experiments, different standard datasets as LRW (Oxford-BBC), MIRACL-VC1, OuluVS, GRID, and CUAVE are used for the evaluation of the proposed model. The experiments also have a comparative analysis of the proposed model with current state-of-the-art methods for Lip-Reading task and keyword spotting task. The proposed model obtain excellent results for all datasets under consideration.
引用
收藏
页码:20461 / 20481
页数:21
相关论文
共 45 条
  • [1] Crowdsourcing the creation of image segmentation algorithms for connectomics
    Arganda-Carreras, Ignacio
    Turaga, Srinivas C.
    Berger, Daniel P.
    Ciresan, Dan
    Giusti, Alessandro
    Gambardella, Luca M.
    Schmidhuber, Juergen
    Laptev, Dmitry
    Dwivedi, Sarvesh
    Buhmann, Joachim M.
    Liu, Ting
    Seyedhosseini, Mojtaba
    Tasdizen, Tolga
    Kamentsky, Lee
    Burget, Radim
    Uher, Vaclav
    Tan, Xiao
    Sun, Changming
    Pham, Tuan D.
    Bas, Erhan
    Uzunbas, Mustafa G.
    Cardona, Albert
    Schindelin, Johannes
    Seung, H. Sebastian
    [J]. FRONTIERS IN NEUROANATOMY, 2015, 9 : 1 - 13
  • [2] MKPLS: Manifold Kernel Partial Least Squares for Lipreading and Speaker Identification
    Bakry, Amr
    Elgammal, Ahmed
    [J]. 2013 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2013, : 684 - 691
  • [3] 3D modeling and tracking of human lip motions
    Basu, S
    Oliver, N
    Pentland, A
    [J]. SIXTH INTERNATIONAL CONFERENCE ON COMPUTER VISION, 1998, : 337 - 343
  • [4] Bourlard Herve A, 2012, Connectionist speech recognition: a hybrid approach, V247
  • [5] Lip Reading Sentences in the Wild
    Chung, Joon Son
    Senior, Andrew
    Vinyals, Oriol
    Zisserman, Andrew
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 3444 - 3450
  • [6] Lip Reading in the Wild
    Chung, Joon Son
    Zisserman, Andrew
    [J]. COMPUTER VISION - ACCV 2016, PT II, 2017, 10112 : 87 - 103
  • [7] An audio-visual corpus for speech perception and automatic speech recognition (L)
    Cooke, Martin
    Barker, Jon
    Cunningham, Stuart
    Shao, Xu
    [J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2006, 120 (05) : 2421 - 2424
  • [8] Cox S. J., 2008, The challenge of multispeaker lip-reading, P179
  • [9] Multi-pose lipreading and audio-visual speech recognition
    Estellers, Virginia
    Thiran, Jean-Philippe
    [J]. EURASIP JOURNAL ON ADVANCES IN SIGNAL PROCESSING, 2012, : 1 - 23
  • [10] Galatas G, 2012, EUR SIGNAL PR CONF, P2714