AUDIO-VISUAL MULTI-CHANNEL SPEECH SEPARATION, DEREVERBERATION AND RECOGNITION

被引：4

作者：

Li, Guinan ^{[1
]}

Yu, Jianwei ^{[1
,2
]}

Deng, Jiajun ^{[1
]}

Liu, Xunying ^{[1
]}

Meng, Helen ^{[1
]}

机构：

[1] Chinese Univ Hong Kong, Hong Kong, Peoples R China

[2] Tencent AI Lab, Bellevue, WA USA

来源：

2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2022年

关键词：

Audio-visual; Speech separation; dereverberation and recognition;

D O I：

10.1109/ICASSP43922.2022.9747237

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Despite the rapid advance of automatic speech recognition (ASR) technologies, accurate recognition of cocktail party speech characterised by the interference from overlapping speakers, background noise and room reverberation remains a highly challenging task to date. Motivated by the invariance of visual modality to acoustic signal corruption, audio-visual speech enhancement techniques have been developed, although predominantly targeting overlapping speech separation and recognition tasks. In this paper, an audiovisual multi-channel speech separation, dereverberation and recognition approach featuring a full incorporation of visual information into all three stages of the system is proposed. The advantage of the additional visual modality over using audio only is demonstrated on two neural dereverberation approaches based on DNN-WPE and spectral mapping respectively. The learning cost function mismatch between the separation and dereverberation models and their integration with the back-end recognition system is minimised using fine-tuning on the MSE and LF-MMI criteria. Experiments conducted on the LRS2 dataset suggest that the proposed audio-visual multi-channel speech separation, dereverberation and recognition system outperforms the baseline audio-visual multi-channel speech separation and recognition system containing no dereverberation module by a statistically significant word error rate (WER) reduction of 2.06 % absolute (8.77 % relative).

引用

页码：6042 / 6046

页数：5

共 50 条

[1] Audio-Visual End-to-End Multi-Channel Speech Separation, Dereverberation and Recognition
Li, Guinan
Deng, Jiajun
Geng, Mengzhe
Jin, Zengrui
Wang, Tianzi
Hu, Shujie
Cui, Mingyu
Meng, Helen
Liu, Xunying
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2023, 31 : 2707 - 2723
[2] Audio-visual Multi-channel Recognition of Overlapped Speech
Yu, Jianwei
Wu, Bo
Gu, Rongzhi
Zhang, Shi-Xiong
Chen, Lianwu
Xu, Yong
Yu, Meng
Su, Dan
Yu, Dong
Liu, Xunying
Meng, Helen
INTERSPEECH 2020, 2020, : 3496 - 3500
[3] Audio-Visual Multi-Channel Integration and Recognition of Overlapped Speech
Yu, Jianwei
Zhang, Shi-Xiong
Wu, Bo
Liu, Shansong
Hu, Shoukang
Geng, Mengzhe
Liu, Xunying
Meng, Helen
Yu, Dong
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 : 2067 - 2082
[4] Audio-Visual Speech Separation and Dereverberation With a Two-Stage Multimodal Network
Tan, Ke
Xu, Yong
Zhang, Shi-Xiong
Yu, Meng
Yu, Dong
IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2020, 14 (03) : 542 - 553
[5] DEEP AUDIO-VISUAL SPEECH SEPARATION WITH ATTENTION MECHANISM
Li, Chenda
Qian, Yanmin
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7314 - 7318
[6] A Robust Audio-visual Speech Recognition Using Audio-visual Voice Activity Detection
Tamura, Satoshi
Ishikawa, Masato
Hashiba, Takashi
Takeuchi, Shin'ichi
Hayamizu, Satoru
11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, : 2702 - +
[7] Audio-Visual Multi-Talker Speech Recognition in A Cocktail Party
Wu, Yifei
Hi, Chenda
Yang, Song
Wu, Zhongqin
Qian, Yanmin
INTERSPEECH 2021, 2021, : 3021 - 3025
[8] FaceFilter: Audio-visual speech separation using still images
Chung, Soo-Whan
Choe, Soyeon
Chung, Joon Son
Kang, Hong-Goo
INTERSPEECH 2020, 2020, : 3481 - 3485
[9] Audio-Visual Speech Separation in Noisy Environments with a Lightweight Iterative Model
Martel, Hector
Richter, Julius
Li, Kai
Hu, Xiaolin
Gerkmann, Timo
INTERSPEECH 2023, 2023, : 1673 - 1677
[10] AN ANALYSIS OF SPEECH ENHANCEMENT AND RECOGNITION LOSSES IN LIMITED RESOURCES MULTI-TALKER SINGLE CHANNEL AUDIO-VISUAL ASR
Pasa, Luca
Morrone, Giovanni
Badino, Leonardo
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7309 - 7313

← 1 2 3 4 5 →