A novel framework using 3D-CNN and BiLSTM model with dynamic learning rate scheduler for visual speech recognition

被引：2

作者：

Chandrabanshi, Vishnu ^{[1
]}

Domnic, S. ^{[1
]}

机构：

[1] Natl Inst Technol, Dept Comp Applicat, Tiruchirappalli 620015, Tamil Nadu, India

来源：

SIGNAL IMAGE AND VIDEO PROCESSING | 2024年 / 18卷 / 6-7期

关键词：

3D-CNN; BiLSTM; Visual speech recognition; LRS; Deep learning; NETWORK;

D O I：

10.1007/s11760-024-03245-7

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Visual Speech Recognition (VSR) is an appealing technology for predicting and analyzing spoken language based on lip movements. Previous research in this area has primarily concentrated on leveraging both audio and visual cues to achieve enhanced accuracy in speech recognition. However, existing solutions encounter significant limitations, including inadequate training data, variations in speech patterns, and similar homophones, which need more comprehensive feature representation to improve accuracy. This article presents a novel deep learning model for performing word level VSR. In this study, we have introduced a dynamic learning rate scheduler to adapt the learning parameter during model training. Additionally, we employ an optimized Three-Dimensional Convolution Neural Network for extracting spatio-temporal features. To enhance context processing and ensure accurate mapping of input sequences to output sequences, we combine Bidirectional Long Short Term Memory with the CTC loss function. We have utilized the GRID dataset to assess word-level metrics, including Word Error Rate (WER) and Word Recognition Rate (WRR). The model achieves 1.11% WER and 98.89% WRR, respectively, for overlapped speakers. This result demonstrates that our strategy outperforms and is more effective than existing VSR methods. Practical Implications - The proposed work aims to elevate the accuracy of VSR, facilitating its seamless integration into real-time applications. The VSR model finds applications in liveness detection for person authentication, improving password security by not relying on written or spoken passcodes, underwater communications and aiding individuals with hearing and speech impairments in the medical field.

引用

页码：5433 / 5448

页数：16

共 58 条

[1] Deep Audio-Visual Speech Recognition
Afouras, Triantafyllos
Chung, Joon Son
Senior, Andrew
Vinyals, Oriol
Zisserman, Andrew
[J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (12) : 8717 - 8727
[2] Afouras T, 2018, Arxiv, DOI arXiv:1809.00496
[3] Agarap AF, 2018, arXiv, DOI [10.48550/arXiv.1803.08375, DOI 10.48550/ARXIV.1803.08375]
[4] Almajai I, 2016, INT CONF ACOUST SPEE, P2722, DOI 10.1109/ICASSP.2016.7472172
[5] Anina I, 2015, IEEE INT CONF AUTOMA
[6] [Anonymous], 2010, INT C MACHINE LEARNI
[7] Hassanat AB, 2014, Arxiv, DOI arXiv:1409.0924
[8] Cho KYHY, 2014, Arxiv, DOI arXiv:1406.1078
[9] Lip Reading Sentences in the Wild
Chung, Joon Son
Senior, Andrew
Vinyals, Oriol
Zisserman, Andrew
[J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 3444 - 3450
[10] Cooke M., 2006, The grid audiovisual sentence corpus

← 1 2 3 4 5 6 →