Speaker independent VSR: A systematic review and futuristic applications

被引:3
作者
Nemani, Praneeth [1 ]
Krishna, Ghanta Sai [2 ]
Supriya, Kundrapu [1 ]
Kumar, Santosh [1 ]
机构
[1] IIIT Naya Raipur, Dept Comp Sci & Engn, Raipur 493661, Chhattisgarh, India
[2] IIIT Naya Raipur, Dept Data Sci & AI, Chattisgarh 493661, India
关键词
VSR; Speaker-independence; Lip-reading; Feature extraction; Spatio-temporal; VISUAL SPEECH RECOGNITION; FACE; EXTRACTION; MACHINE; NORMALIZATION; SEGMENTATION; CLASSIFIER; PHONEMES; FEATURES; MODELS;
D O I
10.1016/j.imavis.2023.104787
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Speaker-independent visual speech recognition (VSR) is a complex task that involves identifying spoken words or phrases from video recordings of a speaker's facial movements. Decoding the intricate visual dynamics of a speaker's mouth in a high-dimensional space is a significant challenge in this field. To address this challenge, researchers have employed advanced techniques that enable machines to recognize human speech through visual cues automatically. Over the years, there has been a considerable amount of research in the field of VSR involving different algorithms and datasets to evaluate system performance. These efforts have resulted in significant progress in developing effective VSR models, creating new opportunities for further research in this area. This survey provides a detailed examination of the progression of VSR over the past three decades, with a particular emphasis on the transition from speaker-dependent to speaker-independent systems. We also provide a comprehensive overview of the various datasets used in VSR research and the preprocessing techniques employed to achieve speaker independence. The survey covers the works published from 1990 to 2023, thor-oughly analyzing each work and comparing them on various parameters. This survey provides an in-depth analysis of speaker-independent VSR systems evolution from 1990 to 2023. It outlines the development of VSR systems over time and highlights the need to develop end-to-end pipelines for speaker-independent VSR. The pictorial representation offers a clear and concise overview of the techniques used in speaker-independent VSR, thereby aiding in the comprehension and analysis of the various methodologies. The survey also highlights the strengths and limitations of each technique and provides insights into developing novel approaches for analyzing visual speech cues. Overall, This comprehensive review provides insights into the current state-of-the-art speaker-independent VSR and highlights potential areas for future research.
引用
收藏
页数:24
相关论文
共 172 条
[1]   Principal component analysis [J].
Abdi, Herve ;
Williams, Lynne J. .
WILEY INTERDISCIPLINARY REVIEWS-COMPUTATIONAL STATISTICS, 2010, 2 (04) :433-459
[2]  
Abrar MA, 2019, IEEE REG 10 HUMANIT, P40, DOI 10.1109/R10-HTC47129.2019.9042439
[3]  
Afouras T, 2018, Arxiv, DOI arXiv:1809.00496
[4]   DISCRETE COSINE TRANSFORM [J].
AHMED, N ;
NATARAJAN, T ;
RAO, KR .
IEEE TRANSACTIONS ON COMPUTERS, 1974, C 23 (01) :90-93
[5]   TinyML: Enabling of Inference Deep Learning Models on Ultra-Low-Power IoT Edge Devices for AI Applications [J].
Alajlan, Norah N. ;
Ibrahim, Dina M. .
MICROMACHINES, 2022, 13 (06)
[6]   Applications of Generative Adversarial Networks (GANs): An Updated Review [J].
Alqahtani, Hamed ;
Kavakli-Thorne, Manolya ;
Kumar, Gulshan .
ARCHIVES OF COMPUTATIONAL METHODS IN ENGINEERING, 2021, 28 (02) :525-552
[7]   Some normative data on lip-reading skills [J].
Altieri, Nicholas A. ;
Pisoni, David B. ;
Townsend, James T. .
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2011, 130 (01) :1-4
[8]  
Amit A., 2016, Lip reading using CNN and LSTM
[9]  
[Anonymous], 2013, Empirical Inference, DOI [DOI 10.1007/978-3-642-41136-65, 10.1007/978-3-642-41136-6_5, DOI 10.1007/978-3-642-41136-6_5]
[10]  
[Anonymous], 2006, P INT C COMP GRAPH I