BenAV: a Bengali Audio-Visual Corpus for Visual Speech Recognition

被引:0
作者
Pondit, Ashish [1 ]
Rukon, Muhammad Eshaque Ali [1 ]
Das, Anik [2 ,3 ]
Kabir, Muhammad Ashad [4 ]
机构
[1] Chittagong Univ Engn & Technol CUET, Dept Comp Sci & Engn, Chattogram 4349, Bangladesh
[2] St Francis Xavier Univ, Dept Comp Sci, Antigonish, NS B2G 2W5, Canada
[3] Bangladesh Univ, Dept Comp Sci & Engn, Dhaka 1207, Bangladesh
[4] Charles Sturt Univ, Sch Comp Math & Engn, Bathurst, NSW 2795, Australia
来源
NEURAL INFORMATION PROCESSING, ICONIP 2021, PT II | 2021年 / 13109卷
关键词
Visual speech recognition; Audio-visual dataset; Lip reading; Corpus; Bengali; Deep learning;
D O I
10.1007/978-3-030-92270-2_45
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Visual speech recognition (VSR) is a very challenging task. It has many applications such as facilitating speech recognition when the acoustic data is noisy or missing, assisting hearing impaired people, etc. Modern VSR systems require a large amount of data to achieve a good performance. Popular VSR datasets are mostly available for the English language and none in Bengali. In this paper, we have introduced a large-scale Bengali audio-visual dataset, named "BenAV". To the best of our knowledge, BenAV is the first publicly available large-scale dataset in the Bengali language. BenAV contains a lexicon of 50 words from 128 speakers with a total number of 26,300 utterances. We have also applied three existing deep learning based VSR models to provide a baseline performance of our BenAV dataset. We run extensive experiments in two different configurations of the dataset to study the robustness of those models and achieved 98.70% and 82.5% accuracy, respectively. We believe that this research provides a basis to develop Bengali lip reading systems and opens the doors to conduct further research on this topic.
引用
收藏
页码:526 / 535
页数:10
相关论文
共 22 条
[1]  
Anina I, 2015, IEEE INT CONF AUTOMA
[2]  
Assael Y.M., 2016, arXiv
[3]   FACE RECOGNITION FROM VIDEO: A REVIEW [J].
Barr, Jeremiah R. ;
Bowyer, Kevin W. ;
Flynn, Patrick J. ;
Biswas, Soma .
INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2012, 26 (05)
[4]   Lip Reading in the Wild [J].
Chung, Joon Son ;
Zisserman, Andrew .
COMPUTER VISION - ACCV 2016, PT II, 2017, 10112 :87-103
[5]   An audio-visual corpus for speech perception and automatic speech recognition (L) [J].
Cooke, Martin ;
Barker, Jon ;
Cunningham, Stuart ;
Shao, Xu .
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2006, 120 (05) :2421-2424
[6]   Survey on automatic lip-reading in the era of deep learning [J].
Fernandez-Lopez, Adriana ;
Sukno, Federico M. .
IMAGE AND VISION COMPUTING, 2018, 78 :53-72
[7]  
Hilder S., 2009, AVSP, P86
[8]   LRRo: A Lip Reading Data Set for the Under-resourced Romanian Language [J].
Jitaru, Andrei Cosmin ;
Abdulamit, Seila ;
Ionescu, Bogdan .
MMSYS'20: PROCEEDINGS OF THE 2020 MULTIMEDIA SYSTEMS CONFERENCE, 2020, :267-272
[9]   One Millisecond Face Alignment with an Ensemble of Regression Trees [J].
Kazemi, Vahid ;
Sullivan, Josephine .
2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2014, :1867-1874
[10]   Extraction of visual features for lipreading [J].
Matthews, I ;
Cootes, TF ;
Bangham, JA ;
Cox, S ;
Harvey, R .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2002, 24 (02) :198-213