LipSyncNet: A Novel Deep Learning Approach for Visual Speech Recognition in Audio-Challenged Situations

被引：2

作者：

Jeevakumari, S. A. Amutha ^{[1
]}

Dey, Koushik ^{[1
]}

机构：

[1] Vellore Inst Technol, Sch Comp Sci & Engn, Chennai 600127, India

来源：

IEEE ACCESS | 2024年 / 12卷

关键词：

Visualization; Accuracy; Speech recognition; Feature extraction; Deep learning; Speech enhancement; Long short term memory; Convolutional neural networks; bidirectional long short-term memory; long-short-term memory; visual cues; lip reading; 3D convolutional neural network; connectionist temporal classification;

D O I：

10.1109/ACCESS.2024.3436931

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

In recent lip-reading technologies, deep learning methodologies have emerged as the key, transcending the limitations of traditional hybrid Deep Neural Network-Hidden Markov Model (DNN-HMM) frameworks based on Discrete Cosine Transform (DCT) features. LipSyncNet comprises a three-dimensional-Convolutional Neural Network (3D-CNN) that consists of a maximum depth of four layers and is responsible for extracting visual features by integrating EfficientNetB0, which results in excellent feature extraction capabilities. Following this, the network architecture incorporates a backend that utilizes a Bidirectional Long Short-Term Memory (Bi-LSTM)-a component of the recurrent neural network family-combined with Connectionist Temporal Classification (CTC) loss, enhancing its ability to perform classification tasks. The effectiveness of the proposed method is demonstrated through the evaluation of the Graphics Research International Database (GRID) corpus, a challenging word-level lip-reading dataset. Initially, facial features are extracted from the mouth area of an individual's face. Subsequently, these features are combined with available audio information to identify spoken words precisely. The lip-reading method aims to create a system that achieves accurate speech recognition by observing visual cues, thereby reducing the reliance on audio. The model utilizes information from various levels in a unified structure, enabling it to differentiate between words that sound alike and to improve its ability to handle changes in physical appearance.

引用

页码：110891 / 110904

页数：14

共 33 条

[1] Lip-Reading Driven Deep Learning Approach for Speech Enhancement [J].

Adeel, Ahsan ;

Gogate, Mandar ;

Hussain, Amir ;

Whitmer, William M. .

IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, 2021, 5 (03) :481-490

[2] Deep Audio-Visual Speech Recognition [J].

Afouras, Triantafyllos ;

Chung, Joon Son ;

Senior, Andrew ;

Vinyals, Oriol ;

Zisserman, Andrew .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (12) :8717-8727

[3]

[Anonymous], 2021, Int. J. Speech Technol., V25, P625, DOI [10.1007/s10772-021-09859-3.10S., DOI 10.1007/S10772-021-09859-3.10S]

[4]

Boyko N, 2018, 2018 IEEE SECOND INTERNATIONAL CONFERENCE ON DATA STREAM MINING & PROCESSING (DSMP), P478, DOI 10.1109/DSMP.2018.8478556

[5] Lip Reading Sentences in the Wild [J].

Chung, Joon Son ;

Senior, Andrew ;

Vinyals, Oriol ;

Zisserman, Andrew .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :3444-3450

[6] An audio-visual corpus for speech perception and automatic speech recognition (L) [J].

Cooke, Martin ;

Barker, Jon ;

Cunningham, Stuart ;

Shao, Xu .

JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2006, 120 (05) :2421-2424

[7] An Effective Conversion of Visemes to Words for High-Performance Automatic Lipreading [J].

Fenghour, Souheil ;

Chen, Daqing ;

Guo, Kun ;

Li, Bo ;

Xiao, Perry .

SENSORS, 2021, 21 (23)

[8] Deep Learning-Based Automated Lip-Reading: A Survey [J].

Fenghour, Souheil ;

Chen, Daqing ;

Guo, Kun ;

Li, Bo ;

Xiao, Perry .

IEEE ACCESS, 2021, 9 (09) :121184-121205

[9] Lip Reading Sentences Using Deep Learning With Only Visual Cues [J].

Fenghour, Souheil ;

Chen, Daqing ;

Guo, Kun ;

Xiao, Perry .

IEEE ACCESS, 2020, 8 :215516-215530

[10] Dynamic Stream Weighting for Turbo-Decoding-Based Audiovisual ASR [J].

Gergen, Sebastian ;

Zeiler, Steffen ;

Abdelaziz, Ahmed Hussen ;

Nickel, Robert ;

Kolossa, Dorothea .

17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, :2135-2139

← 1 2 3 4 →