A novel framework using 3D-CNN and BiLSTM model with dynamic learning rate scheduler for visual speech recognition

被引:2
作者
Chandrabanshi, Vishnu [1 ]
Domnic, S. [1 ]
机构
[1] Natl Inst Technol, Dept Comp Applicat, Tiruchirappalli 620015, Tamil Nadu, India
关键词
3D-CNN; BiLSTM; Visual speech recognition; LRS; Deep learning; NETWORK;
D O I
10.1007/s11760-024-03245-7
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Visual Speech Recognition (VSR) is an appealing technology for predicting and analyzing spoken language based on lip movements. Previous research in this area has primarily concentrated on leveraging both audio and visual cues to achieve enhanced accuracy in speech recognition. However, existing solutions encounter significant limitations, including inadequate training data, variations in speech patterns, and similar homophones, which need more comprehensive feature representation to improve accuracy. This article presents a novel deep learning model for performing word level VSR. In this study, we have introduced a dynamic learning rate scheduler to adapt the learning parameter during model training. Additionally, we employ an optimized Three-Dimensional Convolution Neural Network for extracting spatio-temporal features. To enhance context processing and ensure accurate mapping of input sequences to output sequences, we combine Bidirectional Long Short Term Memory with the CTC loss function. We have utilized the GRID dataset to assess word-level metrics, including Word Error Rate (WER) and Word Recognition Rate (WRR). The model achieves 1.11% WER and 98.89% WRR, respectively, for overlapped speakers. This result demonstrates that our strategy outperforms and is more effective than existing VSR methods. Practical Implications - The proposed work aims to elevate the accuracy of VSR, facilitating its seamless integration into real-time applications. The VSR model finds applications in liveness detection for person authentication, improving password security by not relying on written or spoken passcodes, underwater communications and aiding individuals with hearing and speech impairments in the medical field.
引用
收藏
页码:5433 / 5448
页数:16
相关论文
共 58 条
  • [1] Deep Audio-Visual Speech Recognition
    Afouras, Triantafyllos
    Chung, Joon Son
    Senior, Andrew
    Vinyals, Oriol
    Zisserman, Andrew
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (12) : 8717 - 8727
  • [2] Afouras T, 2018, Arxiv, DOI arXiv:1809.00496
  • [3] Agarap AF, 2018, arXiv, DOI [10.48550/arXiv.1803.08375, DOI 10.48550/ARXIV.1803.08375]
  • [4] Almajai I, 2016, INT CONF ACOUST SPEE, P2722, DOI 10.1109/ICASSP.2016.7472172
  • [5] Anina I, 2015, IEEE INT CONF AUTOMA
  • [6] [Anonymous], 2010, INT C MACHINE LEARNI
  • [7] Hassanat AB, 2014, Arxiv, DOI arXiv:1409.0924
  • [8] Cho KYHY, 2014, Arxiv, DOI arXiv:1406.1078
  • [9] Lip Reading Sentences in the Wild
    Chung, Joon Son
    Senior, Andrew
    Vinyals, Oriol
    Zisserman, Andrew
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 3444 - 3450
  • [10] Cooke M., 2006, The grid audiovisual sentence corpus