Deep Learning-Based Approach for Arabic Visual Speech Recognition

被引:4
作者
Alsulami, Nadia H. [1 ]
Jamal, Amani T. [1 ]
Elrefaei, Lamiaa A. [2 ]
机构
[1] King Abdulaziz Univ, Fac Comp & Informat Technol, Comp Sci Dept, Jeddah 21589, Saudi Arabia
[2] Benha Univ, Fac Engn Shoubra, Elect Engn Dept, Cairo 11629, Egypt
来源
CMC-COMPUTERS MATERIALS & CONTINUA | 2022年 / 71卷 / 01期
关键词
Convolutional neural network; deep learning; lip reading; transfer learning; visual speech recognition;
D O I
10.32604/cmc.2022.019450
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Lip-reading technologies are rapidly progressing following the breakthrough of deep learning. It plays a vital role in its many applications, such as: human-machine communication practices or security applications. In this paper, we propose to develop an effective lip-reading recognition model for Arabic visual speech recognition by implementing deep learning algorithms. The Arabic visual datasets that have been collected contains 2400 records of Arabic digits and 960 records of Arabic phrases from 24 native speakers. The primary purpose is to provide a high-performance model in terms of enhanc-ing the preprocessing phase. Firstly, we extract keyframes from our dataset. Secondly, we produce a Concatenated Frame Images (CFIs) that represent the utterance sequence in one single image. Finally, the VGG-19 is employed for visual features extraction in our proposed model. We have examined different keyframes: 10, 15, and 20 for comparing two types of approaches in the proposed model: (1) the VGG-19 base model and (2) VGG-19 base model with batch normalization. The results show that the second approach achieves greater accuracy: 94% for digit recognition, 97% for phrase recognition, and 93% for digits and phrases recognition in the test dataset. Therefore, our proposed model is superior to models based on CFIs input.
引用
收藏
页码:85 / 108
页数:24
相关论文
共 45 条
[1]  
Abhishek J., 2018, IEEE WINTER C APPL C
[2]  
Agrawal S, 2016, PROCEEDINGS OF THE 2016 2ND INTERNATIONAL CONFERENCE ON APPLIED AND THEORETICAL COMPUTING AND COMMUNICATION TECHNOLOGY (ICATCCT), P753, DOI 10.1109/ICATCCT.2016.7912100
[3]  
Akhter N., 2016, INT C ADV INF COMM T
[4]   AGE ESTIMATION USING SPECIFIC DOMAIN TRANSFER LEARNING [J].
Al-Shannaq, Arwa ;
Elrefaei, Lamiaa .
JORDANIAN JOURNAL OF COMPUTERS AND INFORMATION TECHNOLOGY, 2020, 6 (02) :122-139
[5]  
Anina I., 2015, 2015 11 IEEE INT C W, P1
[6]  
Aran LR, 2017, 2017 IEEE 2ND INTERNATIONAL CONFERENCE ON AUTOMATIC CONTROL AND INTELLIGENT SYSTEMS (I2CACIS), P196, DOI 10.1109/I2CACIS.2017.8239057
[7]  
Assael Yannis M, 2016, ARXIV161101599
[8]  
Bi C., 2019, IEEE 6 INT C SYST IN, P511
[9]   Out of Time: Automated Lip Sync in the Wild [J].
Chung, Joon Son ;
Zisserman, Andrew .
COMPUTER VISION - ACCV 2016 WORKSHOPS, PT II, 2017, 10117 :251-263
[10]   An audio-visual corpus for speech perception and automatic speech recognition (L) [J].
Cooke, Martin ;
Barker, Jon ;
Cunningham, Stuart ;
Shao, Xu .
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2006, 120 (05) :2421-2424