Developing phoneme-based lip-reading sentences system for silent speech recognition

被引：12

作者：

El-Bialy, Randa ^{[1
,2
]}

Chen, Daqing ^{[1
]}

Fenghour, Souheil ^{[1
]}

Hussein, Walid ^{[2
]}

Xiao, Perry ^{[1
]}

Karam, Omar H. ^{[2
]}

Li, Bo ^{[3
]}

机构：

[1] London South Bank Univ, Sch Engn, London, England

[2] British Univ Egypt, Fac Informat & Comp Sci, Cairo, Egypt

[3] Northwestern Polytech Univ, Sch Elect & Informat, Xian, Peoples R China

来源：

CAAI TRANSACTIONS ON INTELLIGENCE TECHNOLOGY | 2023年 / 8卷 / 01期

关键词：

deep learning; deep neural networks; lip-reading; phoneme-based lip-reading; spatial-temporal convolution; transformers;

D O I：

10.1049/cit2.12131

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Lip-reading is a process of interpreting speech by visually analysing lip movements. Recent research in this area has shifted from simple word recognition to lip-reading sentences in the wild. This paper attempts to use phonemes as a classification schema for lip-reading sentences to explore an alternative schema and to enhance system performance. Different classification schemas have been investigated, including character-based and visemes-based schemas. The visual front-end model of the system consists of a Spatial-Temporal (3D) convolution followed by a 2D ResNet. Transformers utilise multi-headed attention for phoneme recognition models. For the language model, a Recurrent Neural Network is used. The performance of the proposed system has been testified with the BBC Lip Reading Sentences 2 (LRS2) benchmark dataset. Compared with the state-of-the-art approaches in lip-reading sentences, the proposed system has demonstrated an improved performance by a 10% lower word error rate on average under varying illumination ratios.

引用

页码：129 / 138

页数：10

共 36 条

[1] Deep Audio-Visual Speech Recognition [J].

Afouras, Triantafyllos ;

Chung, Joon Son ;

Senior, Andrew ;

Vinyals, Oriol ;

Zisserman, Andrew .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (12) :8717-8727

[2] Deep Lip Reading: a comparison of models and an online application [J].

Afouras, Triantafyllos ;

Chung, Joon Son ;

Zisserman, Andrew .

19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, :3514-3518

[3] Alternative Visual Units for an Optimized Phoneme-Based Lipreading System [J].

Bear, Helen L. ;

Harvey, Richard .

APPLIED SCIENCES-BASEL, 2019, 9 (18)

[4] Some observations on computer lip-reading: moving from the dream to the reality [J].

Bear, Helen L. ;

Owen, Gari ;

Harvey, Richard ;

Theobald, Barry-John .

OPTICS AND PHOTONICS FOR COUNTERTERRORISM, CRIME FIGHTING, AND DEFENCE X; AND OPTICAL MATERIALS AND BIOMATERIALS IN SECURITY AND DEFENCE SYSTEMS TECHNOLOGY XI, 2014, 9253

[5]

Chiu CC, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P4774, DOI 10.1109/ICASSP.2018.8462105

[6]

Chung, 2018, LIP READING WILD, DOI [10.1007/978-3-319-54184-6, DOI 10.1007/978-3-319-54184-6]

[7] Lip Reading Sentences in the Wild [J].

Chung, Joon Son ;

Senior, Andrew ;

Vinyals, Oriol ;

Zisserman, Andrew .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :3444-3450

[8] Audio-Visual Speech Modeling for Continuous Speech Recognition [J].

Dupont, Stephane ;

Luettin, Juergen .

IEEE TRANSACTIONS ON MULTIMEDIA, 2000, 2 (03) :141-151

[9] Using Phoneme Representations to Build Predictive Models Robust to ASR Errors [J].

Fang, Anjie ;

Filice, Simone ;

Limsopatham, Nut ;

Rokhlenko, Oleg .

PROCEEDINGS OF THE 43RD INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '20), 2020, :699-708

[10]

Fenghour, 2020, DISENTANGLING HOMOPH, P1

← 1 2 3 4 →