Multimodal Corpus Design for Audio-Visual Speech Recognition in Vehicle Cabin

被引：17

作者：

Kashevnik, Alexey ^{[1
]}

Lashkov, Igor ^{[1
]}

Axyonov, Alexandr ^{[1
]}

Ivanko, Denis ^{[1
]}

Ryumin, Dmitry ^{[1
]}

Kolchin, Artem ^{[2
]}

Karpov, Alexey ^{[1
]}

机构：

[1] Russian Acad Sci SPC RAS, St Petersburg Fed Res Ctr, St Petersburg 199178, Russia

[2] ITMO Univ, Informat Technol & Programming Fac, St Petersburg 197101, Russia

来源：

IEEE ACCESS | 2021年 / 9卷

基金：

俄罗斯基础研究基金会;

关键词：

Vehicles; Speech recognition; Smart phones; Monitoring; Sensors; Vocabulary; Task analysis; Driver monitoring; automatic speech recognition; multimodal corpus; human– computer interaction; ALERTNESS;

D O I：

10.1109/ACCESS.2021.3062752

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

This paper introduces a new methodology aimed at comfort for the driver in-the-wild multimodal corpus creation for audio-visual speech recognition in driver monitoring systems. The presented methodology is universal and can be used for corpus recording for different languages. We present an analysis of speech recognition systems and voice interfaces for driver monitoring systems based on the analysis of both audio and video data. Multimodal speech recognition allows using audio data when video data are useless (e.g. at nighttime), as well as applying video data in acoustically noisy conditions (e.g., at highways). Our methodology identifies the main steps and requirements for multimodal corpus designing, including the development of a new framework for audio-visual corpus creation. We identify the main research questions related to the speech corpus creation task and discuss them in detail in this paper. We also consider some main cases of usage that require speech recognition in a vehicle cabin for interaction with a driver monitoring system. We also consider other important use cases when the system detects dangerous states of driver's drowsiness and starts a question-answer game to prevent dangerous situations. At the end based on the proposed methodology, we developed a mobile application that allows us to record a corpus for the Russian language. We created RUSAVIC corpus using the developed mobile application that at the moment a unique audiovisual corpus for the Russian language that is recorded in-the-wild condition.

引用

页码：34986 / 35003

页数：18

共 50 条

[1] An audio-visual corpus for multimodal automatic speech recognition
Andrzej Czyzewski
Bozena Kostek
Piotr Bratoszewski
Jozef Kotus
Marcin Szykulski
Journal of Intelligent Information Systems, 2017, 49 : 167 - 192
[2] An audio-visual corpus for multimodal automatic speech recognition
Czyzewski, Andrzej
Kostek, Bozena
Bratoszewski, Piotr
Kotus, Jozef
Szykulski, Marcin
JOURNAL OF INTELLIGENT INFORMATION SYSTEMS, 2017, 49 (02) : 167 - 192
[3] Indonesian Audio-Visual Speech Corpus for Multimodal Automatic Speech Recognition
Maulana, Muhammad Rizki Aulia Rahman
Fanany, Mohamad Ivan
2017 INTERNATIONAL CONFERENCE ON ADVANCED COMPUTER SCIENCE AND INFORMATION SYSTEMS (ICACSIS), 2017, : 381 - 385
[4] Building a data corpus for audio-visual speech recognition
Chitu, Alin G.
Rothkrantz, Leon J. M.
EUROMEDIA '2007, 2007, : 88 - 92
[5] DEEP MULTIMODAL LEARNING FOR AUDIO-VISUAL SPEECH RECOGNITION
Mroueh, Youssef
Marcheret, Etienne
Goel, Vaibhava
2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP), 2015, : 2130 - 2134
[6] An audio-visual corpus for speech perception and automatic speech recognition (L)
Cooke, Martin
Barker, Jon
Cunningham, Stuart
Shao, Xu
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2006, 120 (05): : 2421 - 2424
[7] Multimodal Sparse Transformer Network for Audio-Visual Speech Recognition
Song, Qiya
Sun, Bin
Li, Shutao
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, 34 (12) : 10028 - 10038
[8] Multimodal Learning Using 3D Audio-Visual Data or Audio-Visual Speech Recognition
Su, Rongfeng
Wang, Lan
Liu, Xunying
2017 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2017, : 40 - 43
[9] Audio-visual speech recognition based on joint training with audio-visual speech enhancement for robust speech recognition
Hwang, Jung-Wook
Park, Jeongkyun
Park, Rae-Hong
Park, Hyung-Min
APPLIED ACOUSTICS, 2023, 211
[10] Auxiliary Loss Multimodal GRU Model in Audio-Visual Speech Recognition
Yuan, Yuan
Tian, Chunlin
Lu, Xiaoqiang
IEEE ACCESS, 2018, 6 : 5573 - 5583

← 1 2 3 4 5 →