Noise-Robust Multimodal Audio-Visual Speech Recognition System for Speech-Based Interaction Applications

被引:4
|
作者
Jeon, Sanghun [1 ]
Kim, Mun Sang [1 ]
机构
[1] Gwangju Inst Sci & Technol GIST, Sch Integrated Technol, Ctr Healthcare Robot, Gwangju 61005, South Korea
基金
新加坡国家研究基金会;
关键词
deep learning; audiovisual speech recognition; lipreading; multimodal interaction; edutainment; virtual aquarium; PERCEPTION;
D O I
10.3390/s22207738
中图分类号
O65 [分析化学];
学科分类号
070302 ; 081704 ;
摘要
Speech is a commonly used interaction-recognition technique in edutainment-based systems and is a key technology for smooth educational learning and user-system interaction. However, its application to real environments is limited owing to the various noise disruptions in real environments. In this study, an audio and visual information-based multimode interaction system is proposed that enables virtual aquarium systems that use speech to interact to be robust to ambient noise. For audio-based speech recognition, a list of words recognized by a speech API is expressed as word vectors using a pretrained model. Meanwhile, vision-based speech recognition uses a composite end-to-end deep neural network. Subsequently, the vectors derived from the API and vision are classified after concatenation. The signal-to-noise ratio of the proposed system was determined based on data from four types of noise environments. Furthermore, it was tested for accuracy and efficiency against existing single-mode strategies for extracting visual features and audio speech recognition. Its average recognition rate was 91.42% when only speech was used, and improved by 6.7% to 98.12% when audio and visual information were combined. This method can be helpful in various real-world settings where speech recognition is regularly utilized, such as cafes, museums, music halls, and kiosks.
引用
收藏
页数:27
相关论文
共 50 条
  • [1] Audio-visual speech recognition based on joint training with audio-visual speech enhancement for robust speech recognition
    Hwang, Jung-Wook
    Park, Jeongkyun
    Park, Rae-Hong
    Park, Hyung-Min
    APPLIED ACOUSTICS, 2023, 211
  • [2] NTCD-TIMIT: A New Database and Baseline for Noise-robust Audio-visual Speech Recognition
    Abdelaziz, Ahmed Hussen
    18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 3752 - 3756
  • [3] AUDIO-VISUAL DEEP LEARNING FOR NOISE ROBUST SPEECH RECOGNITION
    Huang, Jing
    Kingsbury, Brian
    2013 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2013, : 7596 - 7599
  • [4] Indonesian Audio-Visual Speech Corpus for Multimodal Automatic Speech Recognition
    Maulana, Muhammad Rizki Aulia Rahman
    Fanany, Mohamad Ivan
    2017 INTERNATIONAL CONFERENCE ON ADVANCED COMPUTER SCIENCE AND INFORMATION SYSTEMS (ICACSIS), 2017, : 381 - 385
  • [5] Robust audio-visual speech recognition based on late integration
    Lee, Jong-Seok
    Park, Cheol Hoon
    IEEE TRANSACTIONS ON MULTIMEDIA, 2008, 10 (05) : 767 - 779
  • [6] Robust Audio-Visual Speech Recognition Based on Hybrid Fusion
    Liu, Hong
    Li, Wenhao
    Yang, Bing
    2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 7580 - 7586
  • [7] DEEP MULTIMODAL LEARNING FOR AUDIO-VISUAL SPEECH RECOGNITION
    Mroueh, Youssef
    Marcheret, Etienne
    Goel, Vaibhava
    2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP), 2015, : 2130 - 2134
  • [8] An audio-visual corpus for multimodal automatic speech recognition
    Andrzej Czyzewski
    Bozena Kostek
    Piotr Bratoszewski
    Jozef Kotus
    Marcin Szykulski
    Journal of Intelligent Information Systems, 2017, 49 : 167 - 192
  • [9] An audio-visual corpus for multimodal automatic speech recognition
    Czyzewski, Andrzej
    Kostek, Bozena
    Bratoszewski, Piotr
    Kotus, Jozef
    Szykulski, Marcin
    JOURNAL OF INTELLIGENT INFORMATION SYSTEMS, 2017, 49 (02) : 167 - 192
  • [10] Enhancing Quality and Accuracy of Speech Recognition System by Using Multimodal Audio-Visual Speech signal
    El Maghraby, Eslam E.
    Gody, Amr M.
    Farouk, M. Hesham
    ICENCO 2016 - 2016 12TH INTERNATIONAL COMPUTER ENGINEERING CONFERENCE (ICENCO) - BOUNDLESS SMART SOCIETIES, 2016, : 219 - 229