Audio-visual speech recognition based on joint training with audio-visual speech enhancement for robust speech recognition

被引：4

作者：

Hwang, Jung-Wook ^{[1
]}

Park, Jeongkyun ^{[2
]}

Park, Rae-Hong ^{[1
,3
]}

Park, Hyung-Min ^{[1
]}

机构：

[1] Sogang Univ, Dept Elect Engn, Seoul 04107, South Korea

[2] Sogang Univ, Dept Artificial Intelligence, Seoul 04107, South Korea

[3] Sogang Univ, ICT Convergence Disaster Safety Res Inst, Seoul 04107, South Korea

来源：

APPLIED ACOUSTICS | 2023年 / 211卷

基金：

新加坡国家研究基金会;

关键词：

Audio-visual speech recognition; Audio-visual speech enhancement; Deep learning; Joint training; Conformer; Robust speech recognition; DEREVERBERATION; NOISE;

D O I：

10.1016/j.apacoust.2023.109478

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Visual features are attractive cues that can be used for robust automatic speech recognition (ASR). In par-ticular, speech recognition performance can be improved by combining audio with visual information obtained from the speaker's face rather than using only audio in acoustically unfavorable environments. For this reason, recently, studies on various audio-visual speech recognition (AVSR) models have been actively conducted. However, from the experimental results of the AVSR models, important information for speech recognition is mainly concentrated on audio signals, and visual information plays a role in enhancing the robustness of recognition when the audio signal is corrupted in noisy environments. Therefore, there is a limit to the improvement of the recognition performance of conventional AVSR mod-els in noisy environments. Unlike the conventional AVSR models that directly use input audio-visual information as it is, in this paper, we propose an AVSR model that first performs AVSE to enhance target speech based on audio-visual information and then uses both audio information enhanced by the AVSE and visual information such as the speaker's lips or face. In particular, we propose a deep AVSR model that performs end-to-end training as one model by integrating an AVSR model based on the conformer with hybrid decoding and an AVSE model based on the U-net with recurrent neural network (RNN) atten-tion (RA). Experimental results on the LRS2-BBC and LRS3-TED datasets demonstrate that the AVSE model effectively suppresses corrupting noise and the AVSR model successfully achieves noise robustness. Especially, the proposed jointly trained model integrating the AVSE and AVSR stages into one model showed better recognition performance than the other compared methods.& COPY; 2023 Elsevier Ltd. All rights reserved.

引用

页数：8

共 50 条

[21] DARE: Deceiving Audio-Visual speech Recognition model
Mishra, Saumya
Gupta, Anup Kumar
Gupta, Puneet
KNOWLEDGE-BASED SYSTEMS, 2021, 232
[22] Turbo Decoders for Audio-visual Continuous Speech Recognition
Abdelaziz, Ahmed Hussen
18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 3667 - 3671
[23] DAVIS: Driver's Audio-Visual Speech Recognition
Ivanko, Denis
Ryumin, Dmitry
Kashevnik, Alexey
Axyonov, Alexandr
Kitenko, Andrey
Lashkov, Igor
Karpov, Alexey
INTERSPEECH 2022, 2022, : 1141 - 1142
[24] Part-Based Lipreading for Audio-Visual Speech Recognition
Miao, Ziling
Liu, Hong
Yang, Bing
2020 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS (SMC), 2020, : 2722 - 2726
[25] Large Vocabulary Continuous Audio-Visual Speech Recognition
Sterpu, George
ICMI'18: PROCEEDINGS OF THE 20TH ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2018, : 538 - 541
[26] Audio-visual speech recognition using an infrared headset
Huang, J
Potamianos, G
Connell, J
Neti, C
SPEECH COMMUNICATION, 2004, 44 (1-4) : 83 - 96
[27] Attention-based Audio-Visual Fusion for Robust Automatic Speech Recognition
Sterpu, George
Saam, Christian
Harte, Naomi
ICMI'18: PROCEEDINGS OF THE 20TH ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2018, : 111 - 115
[28] Depth-based Features in Audio-Visual Speech Recognition
Palecek, Karel
Chaloupka, Josef
2016 39TH INTERNATIONAL CONFERENCE ON TELECOMMUNICATIONS AND SIGNAL PROCESSING (TSP), 2016, : 303 - 306
[29] Dynamic Bayesian networks for audio-visual speech recognition
Nefian, AV
Liang, LH
Pi, XB
Liu, XX
Murphy, K
EURASIP JOURNAL ON APPLIED SIGNAL PROCESSING, 2002, 2002 (11) : 1274 - 1288
[30] Visual Context-driven Audio Feature Enhancement for Robust End-to-End Audio-Visual Speech Recognition
Hong, Joanna
Kim, Minsu
Yoo, Daehun
Ro, Yong Man
INTERSPEECH 2022, 2022, : 2838 - 2842

← 1 2 3 4 5 →