Lipreading using Convolutional Neural Network

被引:0
作者
Noda, Kuniaki [1 ]
Yamaguchi, Yuki [2 ]
Nakadai, Kazuhiro [3 ]
Okuno, Hiroshi G. [2 ]
Ogata, Tetsuya [1 ]
机构
[1] Waseda Univ, Grad Sch Fundamental Sci & Engn, Tokyo, Japan
[2] Kyoto Univ, Grad Sch Informat, Kyoto, Japan
[3] Honda Res Inst Japan Co Ltd, Saitama, Japan
来源
15TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2014), VOLS 1-4 | 2014年
关键词
Lipreading; Visual Feature Extraction; Convolutional Neural Network; RECOGNITION;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In recent automatic speech recognition studies, deep learning architecture applications for acoustic modeling have eclipsed conventional sound features such as Mel-frequency cepstral coefficients. However, for visual speech recognition (VSR) studies, handcrafted visual feature extraction mechanisms are still widely utilized. In this paper, we propose to apply a convolutional neural network (CNN) as a visual feature extraction mechanism for VSR. By training a CNN with images of a speaker's mouth area in combination with phoneme labels, the CNN acquires multiple convolutional filters, used to extract visual features essential for recognizing phonemes. Further, by modeling the temporal dependencies of the generated phoneme label sequences, a hidden Markov model in our proposed system recognizes multiple isolated words. Our proposed system is evaluated on an audio-visual speech dataset comprising 300 Japanese words with six different speakers. The evaluation results of our isolated word recognition experiment demonstrate that the visual features acquired by the CNN significantly outperform those acquired by conventional dimensionality compression approaches, including principal component analysis.
引用
收藏
页码:1149 / 1153
页数:5
相关论文
共 22 条
[1]  
Abdel-hamid O., 2013, 14 ANN C INT SPEECH
[2]  
Aleksic PS, 2004, 2004 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL V, PROCEEDINGS, P917
[3]  
[Anonymous], P 14 INT C PHON SCI
[4]  
[Anonymous], IMPROVING PHOTO SEAR
[5]  
[Anonymous], 2009, HTK BOOK HTK VERSION
[6]  
[Anonymous], 2010, AUDITORY VISUAL SPEE
[7]  
[Anonymous], 2009, P 26 ANN INT C MACHI, DOI DOI 10.1145/1553374.1553453
[8]  
[Anonymous], P IEEE INT C MULT EX
[9]   Learning Deep Architectures for AI [J].
Bengio, Yoshua .
FOUNDATIONS AND TRENDS IN MACHINE LEARNING, 2009, 2 (01) :1-127
[10]   Active appearance models [J].
Cootes, TF ;
Edwards, GJ ;
Taylor, CJ .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2001, 23 (06) :681-685