AUDIO-VISUAL MULTI-CHANNEL SPEECH SEPARATION, DEREVERBERATION AND RECOGNITION

被引:4
作者
Li, Guinan [1 ]
Yu, Jianwei [1 ,2 ]
Deng, Jiajun [1 ]
Liu, Xunying [1 ]
Meng, Helen [1 ]
机构
[1] Chinese Univ Hong Kong, Hong Kong, Peoples R China
[2] Tencent AI Lab, Bellevue, WA USA
来源
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2022年
关键词
Audio-visual; Speech separation; dereverberation and recognition;
D O I
10.1109/ICASSP43922.2022.9747237
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Despite the rapid advance of automatic speech recognition (ASR) technologies, accurate recognition of cocktail party speech characterised by the interference from overlapping speakers, background noise and room reverberation remains a highly challenging task to date. Motivated by the invariance of visual modality to acoustic signal corruption, audio-visual speech enhancement techniques have been developed, although predominantly targeting overlapping speech separation and recognition tasks. In this paper, an audiovisual multi-channel speech separation, dereverberation and recognition approach featuring a full incorporation of visual information into all three stages of the system is proposed. The advantage of the additional visual modality over using audio only is demonstrated on two neural dereverberation approaches based on DNN-WPE and spectral mapping respectively. The learning cost function mismatch between the separation and dereverberation models and their integration with the back-end recognition system is minimised using fine-tuning on the MSE and LF-MMI criteria. Experiments conducted on the LRS2 dataset suggest that the proposed audio-visual multi-channel speech separation, dereverberation and recognition system outperforms the baseline audio-visual multi-channel speech separation and recognition system containing no dereverberation module by a statistically significant word error rate (WER) reduction of 2.06 % absolute (8.77 % relative).
引用
收藏
页码:6042 / 6046
页数:5
相关论文
共 50 条
  • [31] MuAViC: A Multilingual Audio-Visual Corpus for Robust Speech Recognition and Robust Speech-to-Text Translation
    Anwar, Mohamed
    Shi, Bowen
    Goswami, Vedanuj
    Hsu, Wei-Ning
    Pino, Juan
    Wang, Changhan
    INTERSPEECH 2023, 2023, : 4064 - 4068
  • [32] Multifactor fusion for audio-visual speaker recognition
    Chetty, Girija
    Tran, Dat
    LECTURE NOTES IN SIGNAL SCIENCE, INTERNET AND EDUCATION (SSIP'07/MIV'07/DIWEB'07), 2007, : 70 - +
  • [33] AUDIO-VISUAL RECOGNITION OF GOOSE FLOCKING BEHAVIOR
    Steen, Kim Arild
    Therkildsen, Ole Roland
    Green, Ole
    Karstoft, Henrik
    INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2013, 27 (07)
  • [34] My lips are concealed: Audio-visual speech enhancement through obstructions
    Afouras, Triantafyllos
    Chung, Joon Son
    Zisserman, Andrew
    INTERSPEECH 2019, 2019, : 4295 - 4299
  • [35] Statistical multimodal integration for audio-visual speech processing
    Nakamura, S
    IEEE TRANSACTIONS ON NEURAL NETWORKS, 2002, 13 (04): : 854 - 866
  • [36] Edged based Audio-Visual Speech enhancement demonstrator
    Chen, Song
    Gogate, Mandar
    Dashtipour, Kia
    Kirton-Wingate, Jasper
    Hussain, Adeel
    Doctor, Faiyaz
    Arslan, Tughrul
    Hussain, Amir
    INTERSPEECH 2024, 2024, : 2032 - 2033
  • [37] Cortical integration of audio-visual speech and non-speech stimuli
    Wyk, Brent C. Vander
    Ramsay, Gordon J.
    Hudac, Caitlin M.
    Jones, Warren
    Lin, David
    Klin, Ami
    Lee, Su Mei
    Pelphrey, Kevin A.
    BRAIN AND COGNITION, 2010, 74 (02) : 97 - 106
  • [38] Multi-Task Joint Learning for Embedding Aware Audio-Visual Speech Enhancement
    Wang, Chenxi
    Chen, Hang
    Du, Jun
    Yin, Baocai
    Pan, Jia
    2022 13TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2022, : 255 - 259
  • [39] Improving Visual Speech Enhancement Network by Learning Audio-visual Affinity with Multi-head Attention
    Xu, Xinmeng
    Wang, Yang
    Jia, Jie
    Chen, Binbin
    Li, Dejun
    INTERSPEECH 2022, 2022, : 971 - 975
  • [40] Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation
    Ephrat, Ariel
    Mosseri, Inbar
    Lang, Oran
    Dekel, Tali
    Wilson, Kevin
    Hassidim, Avinatan
    Freeman, William T.
    Rubinstein, Michael
    ACM TRANSACTIONS ON GRAPHICS, 2018, 37 (04):