Automatic Lip Reading Using Convolution Neural Network and Bidirectional Long Short-term Memory

被引:8
作者
Lu, Yuanyao [1 ]
Yan, Jie [1 ]
机构
[1] North China Univ Technol, Sch Elect & Informat Engn, Beijing, Peoples R China
基金
中国国家自然科学基金; 北京市自然科学基金;
关键词
Automatic lip reading; deep learning; convolution neural network; bidirectional long short-term memory; MODELS;
D O I
10.1142/S0218001420540038
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Traditional automatic lip-reading systems generally consist of two stages: feature extraction and recognition, while the handcrafted features are empirical and cannot learn the relevance of lip movement sequence sufficiently. Recently, deep learning approaches have attracted increasing attention, especially the significant improvements of convolution neural network (CNN) applied to image classification and long short-term memory (LSTM) used in speech recognition, video processing and text analysis. In this paper, we propose a hybrid neural network architecture, which integrates CNN and bidirectional LSTM (BiLSTM) for lip reading. First, we extract key frames from each isolated video clip and use five key points to locate mouth region. Then, features are extracted from raw mouth images using an eight-layer CNN. The extracted features have the characteristics of stronger robustness and fault-tolerant capability. Finally, we use BiLSTM to capture the correlation of sequential information among frame features in two directions and the softmax function to predict final recognition result. The proposed method is capable of extracting local features through convolution operations and finding hidden correlation in temporal information from lip image sequences. The evaluation results of lip-reading recognition experiments demonstrate that our proposed method outperforms conventional approaches such as active contour model (ACM) and hidden Markov model (HMM).
引用
收藏
页数:14
相关论文
共 27 条
[1]  
Aleksic PS, 2004, 2004 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL V, PROCEEDINGS, P917
[2]  
[Anonymous], IEEE 4 WORKSH MULT S
[3]  
[Anonymous], 2014, Int J Sci Basic Appl Res (IJSBAR)
[4]  
[Anonymous], ADV NEURAL INFORM PR
[5]  
[Anonymous], 2017, P IEEE C COMP VIS PA
[6]  
[Anonymous], 2016, IAPR WORKSH MULT PAT
[7]   Multiple cameras audio visual speech recognition using active appearance model visual features in car environment [J].
Biswas A. ;
Sahu P.K. ;
Chandra M. .
International Journal of Speech Technology, 2016, 19 (01) :159-171
[8]   Multi-class Multi-object Tracking Using Changing Point Detection [J].
Lee, Byungjae ;
Erdenee, Enkhbayar ;
Jin, Songguo ;
Nam, Mi Young ;
Jung, Young Giu ;
Rhee, Phill Kyu .
COMPUTER VISION - ECCV 2016 WORKSHOPS, PT II, 2016, 9914 :68-83
[9]  
Chitu AG, 2010, LECT NOTES ARTIF INT, V6231, P259, DOI 10.1007/978-3-642-15760-8_33
[10]  
Fan XH, 2012, CHIN CONT DECIS CONF, P648, DOI 10.1109/CCDC.2012.6242980