LSTM model for visual speech recognition through facial expressions

被引:21
作者
Bhaskar, Shabina [1 ]
Thasleema, T. M. [1 ]
机构
[1] Cent Univ Kerala, Kasaragod, Kerala, India
关键词
Audio-visual emotion recognition; Audio-visual speech recognition; Hearing impaired; Convolutional neural network; Long short term memory; FEATURES;
D O I
10.1007/s11042-022-12796-1
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Hearing impaired persons are more expressive while speaking and expression is a salient feature in hearing impaired Visual Speech Recognition. Most Visual Speech Recognition systems focus only on the lip area for the recognition of speech or speaker. This work utilizes video data which includes information from both speech and facial expressions. As part of this study, we have developed a Malayalam audio-visual speech expression database of unimpaired people. The experiments were conducted on this newly developed Malayalam audio-visual speech database. The data has been collected from two people, 1 male, and 1 female. A combination of Convolutional Neural Network-Long Short Term Memory deep learning video processing model is applied for this system. The result demonstrate that, the classification accuracy is better for the features extracted using GoogleNet model compared to AlexNet and ResNet model. The system evaluation is carried out in both Speaker-dependent and speaker-independent domains. The recognition rate of the system for both speaker-dependent and speaker-independent experiments proves that facial expression analysis plays a crucial role in Visual Speech Recognition.
引用
收藏
页码:5455 / 5472
页数:18
相关论文
共 47 条
[1]  
[Anonymous], 2004, P 6 INT C MULT INT I, DOI [DOI 10.1145/1027933.1027968, 10.1145/1027933]
[2]  
[Anonymous], 2008, Advances in neural information processing systems
[3]  
[Anonymous], 2006, 22 INT C DAT ENG WOR, DOI [DOI 10.1109/ICDEW.2006.145, 10.1109/ICDEW.2006.145]
[4]   A strategic approach to recognize the speech of the children with hearing impairment: different sets of features and models [J].
Arunachalam, Revathi .
MULTIMEDIA TOOLS AND APPLICATIONS, 2019, 78 (15) :20787-20808
[5]   Audiovisual emotion recognition in wild [J].
Avots, Egils ;
Sapinski, Tomasz ;
Bachmann, Maie ;
Kaminska, Dorota .
MACHINE VISION AND APPLICATIONS, 2019, 30 (05) :975-985
[6]  
Bao W, 2014, INT CONF SIGN PROCES, P583, DOI 10.1109/ICOSP.2014.7015071
[7]   IEMOCAP: interactive emotional dyadic motion capture database [J].
Busso, Carlos ;
Bulut, Murtaza ;
Lee, Chi-Chun ;
Kazemzadeh, Abe ;
Mower, Emily ;
Kim, Samuel ;
Chang, Jeannette N. ;
Lee, Sungbok ;
Narayanan, Shrikanth S. .
LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) :335-359
[8]   A real-time and high-precision method for small traffic-signs recognition [J].
Chen, Junzhou ;
Jia, Kunkun ;
Chen, Wenquan ;
Lv, Zhihan ;
Zhang, Ronghui .
NEURAL COMPUTING & APPLICATIONS, 2022, 34 (03) :2233-2245
[9]   Lipreading with DenseNet and resBi-LSTM [J].
Chen, Xuejuan ;
Du, Jixiang ;
Zhang, Hongbo .
SIGNAL IMAGE AND VIDEO PROCESSING, 2020, 14 (05) :981-989
[10]   Lip Reading in the Wild [J].
Chung, Joon Son ;
Zisserman, Andrew .
COMPUTER VISION - ACCV 2016, PT II, 2017, 10112 :87-103