Developing phoneme-based lip-reading sentences system for silent speech recognition

被引:12
作者
El-Bialy, Randa [1 ,2 ]
Chen, Daqing [1 ]
Fenghour, Souheil [1 ]
Hussein, Walid [2 ]
Xiao, Perry [1 ]
Karam, Omar H. [2 ]
Li, Bo [3 ]
机构
[1] London South Bank Univ, Sch Engn, London, England
[2] British Univ Egypt, Fac Informat & Comp Sci, Cairo, Egypt
[3] Northwestern Polytech Univ, Sch Elect & Informat, Xian, Peoples R China
关键词
deep learning; deep neural networks; lip-reading; phoneme-based lip-reading; spatial-temporal convolution; transformers;
D O I
10.1049/cit2.12131
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Lip-reading is a process of interpreting speech by visually analysing lip movements. Recent research in this area has shifted from simple word recognition to lip-reading sentences in the wild. This paper attempts to use phonemes as a classification schema for lip-reading sentences to explore an alternative schema and to enhance system performance. Different classification schemas have been investigated, including character-based and visemes-based schemas. The visual front-end model of the system consists of a Spatial-Temporal (3D) convolution followed by a 2D ResNet. Transformers utilise multi-headed attention for phoneme recognition models. For the language model, a Recurrent Neural Network is used. The performance of the proposed system has been testified with the BBC Lip Reading Sentences 2 (LRS2) benchmark dataset. Compared with the state-of-the-art approaches in lip-reading sentences, the proposed system has demonstrated an improved performance by a 10% lower word error rate on average under varying illumination ratios.
引用
收藏
页码:129 / 138
页数:10
相关论文
共 36 条
[1]   Deep Audio-Visual Speech Recognition [J].
Afouras, Triantafyllos ;
Chung, Joon Son ;
Senior, Andrew ;
Vinyals, Oriol ;
Zisserman, Andrew .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (12) :8717-8727
[2]   Deep Lip Reading: a comparison of models and an online application [J].
Afouras, Triantafyllos ;
Chung, Joon Son ;
Zisserman, Andrew .
19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, :3514-3518
[3]   Alternative Visual Units for an Optimized Phoneme-Based Lipreading System [J].
Bear, Helen L. ;
Harvey, Richard .
APPLIED SCIENCES-BASEL, 2019, 9 (18)
[4]   Some observations on computer lip-reading: moving from the dream to the reality [J].
Bear, Helen L. ;
Owen, Gari ;
Harvey, Richard ;
Theobald, Barry-John .
OPTICS AND PHOTONICS FOR COUNTERTERRORISM, CRIME FIGHTING, AND DEFENCE X; AND OPTICAL MATERIALS AND BIOMATERIALS IN SECURITY AND DEFENCE SYSTEMS TECHNOLOGY XI, 2014, 9253
[5]  
Chiu CC, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P4774, DOI 10.1109/ICASSP.2018.8462105
[6]  
Chung, 2018, LIP READING WILD, DOI [10.1007/978-3-319-54184-6, DOI 10.1007/978-3-319-54184-6]
[7]   Lip Reading Sentences in the Wild [J].
Chung, Joon Son ;
Senior, Andrew ;
Vinyals, Oriol ;
Zisserman, Andrew .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :3444-3450
[8]   Audio-Visual Speech Modeling for Continuous Speech Recognition [J].
Dupont, Stephane ;
Luettin, Juergen .
IEEE TRANSACTIONS ON MULTIMEDIA, 2000, 2 (03) :141-151
[9]   Using Phoneme Representations to Build Predictive Models Robust to ASR Errors [J].
Fang, Anjie ;
Filice, Simone ;
Limsopatham, Nut ;
Rokhlenko, Oleg .
PROCEEDINGS OF THE 43RD INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '20), 2020, :699-708
[10]  
Fenghour, 2020, DISENTANGLING HOMOPH, P1