AUDIO-VISUAL MULTI-CHANNEL SPEECH SEPARATION, DEREVERBERATION AND RECOGNITION

被引：4

作者：

Li, Guinan ^{[1
]}

Yu, Jianwei ^{[1
,2
]}

Deng, Jiajun ^{[1
]}

Liu, Xunying ^{[1
]}

Meng, Helen ^{[1
]}

机构：

[1] Chinese Univ Hong Kong, Hong Kong, Peoples R China

[2] Tencent AI Lab, Bellevue, WA USA

来源：

2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2022年

关键词：

Audio-visual; Speech separation; dereverberation and recognition;

D O I：

10.1109/ICASSP43922.2022.9747237

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Despite the rapid advance of automatic speech recognition (ASR) technologies, accurate recognition of cocktail party speech characterised by the interference from overlapping speakers, background noise and room reverberation remains a highly challenging task to date. Motivated by the invariance of visual modality to acoustic signal corruption, audio-visual speech enhancement techniques have been developed, although predominantly targeting overlapping speech separation and recognition tasks. In this paper, an audiovisual multi-channel speech separation, dereverberation and recognition approach featuring a full incorporation of visual information into all three stages of the system is proposed. The advantage of the additional visual modality over using audio only is demonstrated on two neural dereverberation approaches based on DNN-WPE and spectral mapping respectively. The learning cost function mismatch between the separation and dereverberation models and their integration with the back-end recognition system is minimised using fine-tuning on the MSE and LF-MMI criteria. Experiments conducted on the LRS2 dataset suggest that the proposed audio-visual multi-channel speech separation, dereverberation and recognition system outperforms the baseline audio-visual multi-channel speech separation and recognition system containing no dereverberation module by a statistically significant word error rate (WER) reduction of 2.06 % absolute (8.77 % relative).

引用

页码：6042 / 6046

页数：5

共 50 条

[21] Multi-Stream Asynchrony Dynamic Bayesian Network model for audio-visual continuous speech recognition
Lv, Guoyun
Jiang, Dongmei
Zhao, Rongchun
Jiang, Xiaoyue
Sahli, H.
2007 14TH INTERNATIONAL WORKSHOP ON SYSTEMS, SIGNALS, & IMAGE PROCESSING & EURASIP CONFERENCE FOCUSED ON SPEECH & IMAGE PROCESSING, MULTIMEDIA COMMUNICATIONS & SERVICES, 2007, : 170 - +
[22] The 'Audio-Visual Face Cover Corpus': Investigations into audio-visual speech and speaker recognition when the speaker's face is occluded by facewear
Fecher, Natalie
13TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2012 (INTERSPEECH 2012), VOLS 1-3, 2012, : 2247 - 2250
[23] DNN driven Speaker Independent Audio-Visual Mask Estimation for Speech Separation
Gogate, Mandar
Adeel, Ahsan
Marxer, Ricard
Barker, Jon
Hussain, Amir
19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 2723 - 2727
[24] An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation
Michelsanti, Daniel
Tan, Zheng-Hua
Zhang, Shi-Xiong
Xu, Yong
Yu, Meng
Yu, Dong
Jensen, Jesper
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 : 1368 - 1396
[25] Multi-channel separation of dynamic speech and sound events
Fujimura, Takuya
Scheibler, Robin
INTERSPEECH 2023, 2023, : 3749 - 3753
[26] A robust visual feature extraction based BTSM-LDA for audio-visual speech recognition
Lv, Guoyun
Zhao, Rongchun
Jiang, Dongmei
Li, Yan
Sahli, H.
2007 SECOND INTERNATIONAL CONFERENCE IN COMMUNICATIONS AND NETWORKING IN CHINA, VOLS 1 AND 2, 2007, : 1044 - +
[27] Improved Lite Audio-Visual Speech Enhancement
Chuang, Shang-Yi
Wang, Hsin-Min
Tsao, Yu
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 1345 - 1359
[28] AUDIO-VISUAL SPEECH INPAINTING WITH DEEP LEARNING
Morrone, Giovanni
Michelsanti, Daniel
Tan, Zheng-Hua
Jensen, Jesper
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6653 - 6657
[29] A ROBUST AUDIO-VISUAL SPEECH ENHANCEMENT MODEL
Wang, Wupeng
Xing, Chao
Wang, Dong
Chen, Xiao
Sun, Fengyu
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7529 - 7533
[30] An Audio-Visual Speech Separation Model Inspired by Cortico-Thalamo-Cortical Circuits
Li, Kai
Xie, Fenghua
Chen, Hang
Yuan, Kexin
Hu, Xiaolin
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (10) : 6637 - 6651

← 1 2 3 4 5 →