BenAV: a Bengali Audio-Visual Corpus for Visual Speech Recognition

被引：0

作者：

Pondit, Ashish ^{[1
]}

Rukon, Muhammad Eshaque Ali ^{[1
]}

Das, Anik ^{[2
,3
]}

Kabir, Muhammad Ashad ^{[4
]}

机构：

[1] Chittagong Univ Engn & Technol CUET, Dept Comp Sci & Engn, Chattogram 4349, Bangladesh

[2] St Francis Xavier Univ, Dept Comp Sci, Antigonish, NS B2G 2W5, Canada

[3] Bangladesh Univ, Dept Comp Sci & Engn, Dhaka 1207, Bangladesh

[4] Charles Sturt Univ, Sch Comp Math & Engn, Bathurst, NSW 2795, Australia

来源：

NEURAL INFORMATION PROCESSING, ICONIP 2021, PT II | 2021年 / 13109卷

关键词：

Visual speech recognition; Audio-visual dataset; Lip reading; Corpus; Bengali; Deep learning;

D O I：

10.1007/978-3-030-92270-2_45

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Visual speech recognition (VSR) is a very challenging task. It has many applications such as facilitating speech recognition when the acoustic data is noisy or missing, assisting hearing impaired people, etc. Modern VSR systems require a large amount of data to achieve a good performance. Popular VSR datasets are mostly available for the English language and none in Bengali. In this paper, we have introduced a large-scale Bengali audio-visual dataset, named "BenAV". To the best of our knowledge, BenAV is the first publicly available large-scale dataset in the Bengali language. BenAV contains a lexicon of 50 words from 128 speakers with a total number of 26,300 utterances. We have also applied three existing deep learning based VSR models to provide a baseline performance of our BenAV dataset. We run extensive experiments in two different configurations of the dataset to study the robustness of those models and achieved 98.70% and 82.5% accuracy, respectively. We believe that this research provides a basis to develop Bengali lip reading systems and opens the doors to conduct further research on this topic.

引用

页码：526 / 535

页数：10

共 22 条

[1]

Anina I, 2015, IEEE INT CONF AUTOMA

[2]

Assael Y.M., 2016, arXiv

[3] FACE RECOGNITION FROM VIDEO: A REVIEW [J].

Barr, Jeremiah R. ;

Bowyer, Kevin W. ;

Flynn, Patrick J. ;

Biswas, Soma .

INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2012, 26 (05)

[4] Lip Reading in the Wild [J].

Chung, Joon Son ;

Zisserman, Andrew .

COMPUTER VISION - ACCV 2016, PT II, 2017, 10112 :87-103

[5] An audio-visual corpus for speech perception and automatic speech recognition (L) [J].

Cooke, Martin ;

Barker, Jon ;

Cunningham, Stuart ;

Shao, Xu .

JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2006, 120 (05) :2421-2424

[6] Survey on automatic lip-reading in the era of deep learning [J].

Fernandez-Lopez, Adriana ;

Sukno, Federico M. .

IMAGE AND VISION COMPUTING, 2018, 78 :53-72

[7]

Hilder S., 2009, AVSP, P86

[8] LRRo: A Lip Reading Data Set for the Under-resourced Romanian Language [J].

Jitaru, Andrei Cosmin ;

Abdulamit, Seila ;

Ionescu, Bogdan .

MMSYS'20: PROCEEDINGS OF THE 2020 MULTIMEDIA SYSTEMS CONFERENCE, 2020, :267-272

[9] One Millisecond Face Alignment with an Ensemble of Regression Trees [J].

Kazemi, Vahid ;

Sullivan, Josephine .

2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2014, :1867-1874

[10] Extraction of visual features for lipreading [J].

Matthews, I ;

Cootes, TF ;

Bangham, JA ;

Cox, S ;

Harvey, R .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2002, 24 (02) :198-213

← 1 2 3 →