Noise-Robust Multimodal Audio-Visual Speech Recognition System for Speech-Based Interaction Applications

被引：4

作者：

Jeon, Sanghun ^{[1
]}

Kim, Mun Sang ^{[1
]}

机构：

[1] Gwangju Inst Sci & Technol GIST, Sch Integrated Technol, Ctr Healthcare Robot, Gwangju 61005, South Korea

来源：

SENSORS | 2022年 / 22卷 / 20期

基金：

新加坡国家研究基金会;

关键词：

deep learning; audiovisual speech recognition; lipreading; multimodal interaction; edutainment; virtual aquarium; PERCEPTION;

D O I：

10.3390/s22207738

中图分类号：

O65 [分析化学];

学科分类号：

070302 ; 081704 ;

摘要：

Speech is a commonly used interaction-recognition technique in edutainment-based systems and is a key technology for smooth educational learning and user-system interaction. However, its application to real environments is limited owing to the various noise disruptions in real environments. In this study, an audio and visual information-based multimode interaction system is proposed that enables virtual aquarium systems that use speech to interact to be robust to ambient noise. For audio-based speech recognition, a list of words recognized by a speech API is expressed as word vectors using a pretrained model. Meanwhile, vision-based speech recognition uses a composite end-to-end deep neural network. Subsequently, the vectors derived from the API and vision are classified after concatenation. The signal-to-noise ratio of the proposed system was determined based on data from four types of noise environments. Furthermore, it was tested for accuracy and efficiency against existing single-mode strategies for extracting visual features and audio speech recognition. Its average recognition rate was 91.42% when only speech was used, and improved by 6.7% to 98.12% when audio and visual information were combined. This method can be helpful in various real-world settings where speech recognition is regularly utilized, such as cafes, museums, music halls, and kiosks.

引用

页数：27

共 50 条

[1] Audio-visual speech recognition based on joint training with audio-visual speech enhancement for robust speech recognition
Hwang, Jung-Wook
Park, Jeongkyun
Park, Rae-Hong
Park, Hyung-Min
APPLIED ACOUSTICS, 2023, 211
[2] NTCD-TIMIT: A New Database and Baseline for Noise-robust Audio-visual Speech Recognition
Abdelaziz, Ahmed Hussen
18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 3752 - 3756
[3] AUDIO-VISUAL DEEP LEARNING FOR NOISE ROBUST SPEECH RECOGNITION
Huang, Jing
Kingsbury, Brian
2013 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2013, : 7596 - 7599
[4] Indonesian Audio-Visual Speech Corpus for Multimodal Automatic Speech Recognition
Maulana, Muhammad Rizki Aulia Rahman
Fanany, Mohamad Ivan
2017 INTERNATIONAL CONFERENCE ON ADVANCED COMPUTER SCIENCE AND INFORMATION SYSTEMS (ICACSIS), 2017, : 381 - 385
[5] Robust audio-visual speech recognition based on late integration
Lee, Jong-Seok
Park, Cheol Hoon
IEEE TRANSACTIONS ON MULTIMEDIA, 2008, 10 (05) : 767 - 779
[6] Robust Audio-Visual Speech Recognition Based on Hybrid Fusion
Liu, Hong
Li, Wenhao
Yang, Bing
2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 7580 - 7586
[7] DEEP MULTIMODAL LEARNING FOR AUDIO-VISUAL SPEECH RECOGNITION
Mroueh, Youssef
Marcheret, Etienne
Goel, Vaibhava
2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP), 2015, : 2130 - 2134
[8] An audio-visual corpus for multimodal automatic speech recognition
Andrzej Czyzewski
Bozena Kostek
Piotr Bratoszewski
Jozef Kotus
Marcin Szykulski
Journal of Intelligent Information Systems, 2017, 49 : 167 - 192
[9] An audio-visual corpus for multimodal automatic speech recognition
Czyzewski, Andrzej
Kostek, Bozena
Bratoszewski, Piotr
Kotus, Jozef
Szykulski, Marcin
JOURNAL OF INTELLIGENT INFORMATION SYSTEMS, 2017, 49 (02) : 167 - 192
[10] Enhancing Quality and Accuracy of Speech Recognition System by Using Multimodal Audio-Visual Speech signal
El Maghraby, Eslam E.
Gody, Amr M.
Farouk, M. Hesham
ICENCO 2016 - 2016 12TH INTERNATIONAL COMPUTER ENGINEERING CONFERENCE (ICENCO) - BOUNDLESS SMART SOCIETIES, 2016, : 219 - 229

← 1 2 3 4 5 →