Audio-visual Multi-channel Recognition of Overlapped Speech

被引:11
作者
Yu, Jianwei [1 ,2 ]
Wu, Bo [2 ]
Gu, Rongzhi [2 ]
Zhang, Shi-Xiong [2 ]
Chen, Lianwu [2 ]
Xu, Yong [2 ]
Yu, Meng [2 ]
Su, Dan [2 ]
Yu, Dong [2 ]
Liu, Xunying [1 ]
Meng, Helen [1 ]
机构
[1] Chinese Univ Hong Kong, Hong Kong, Peoples R China
[2] Tencent AI Lab, Bellevue, WA 98004 USA
来源
INTERSPEECH 2020 | 2020年
关键词
Overlapped speech recognition; Speech separation; Audio-visual; Multi-channel; NEURAL-NETWORKS; MODELS;
D O I
10.21437/Interspeech.2020-2346
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Automatic speech recognition (ASR) of overlapped speech remains a highly challenging task to date. To this end, multi-channel microphone array data are widely used in state-of-the-art ASR systems. Motivated by the invariance of visual modality to acoustic signal corruption, this paper presents an audio-visual multi-channel overlapped speech recognition system featuring tightly integrated separation front-end and recognition back-end. A series of audio-visual multi-channel speech separation front-end components based on TF masking, filter&sum and mask-based MVDR beamforming approaches were developed. To reduce the error cost mismatch between the separation and recognition components, they were jointly fine-tuned using the connectionist temporal classification (CTC) loss function, or a multi-task criterion interpolation with scale-invariant signal to noise ratio (Si-SNR) error cost. Experiments suggest that the proposed multi-channel AVSR system outperforms the baseline audio-only ASR system by up to 6.81% (26.83% relative) and 22.22% (56.87% relative) absolute word error rate (WER) reduction on overlapped speech constructed using either simulation or replaying of the lipreading sentence 2 (LRS2) dataset respectively.
引用
收藏
页码:3496 / 3500
页数:5
相关论文
共 39 条
[1]   Deep Audio-Visual Speech Recognition [J].
Afouras, Triantafyllos ;
Chung, Joon Son ;
Senior, Andrew ;
Vinyals, Oriol ;
Zisserman, Andrew .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (12) :8717-8727
[2]   My lips are concealed: Audio-visual speech enhancement through obstructions [J].
Afouras, Triantafyllos ;
Chung, Joon Son ;
Zisserman, Andrew .
INTERSPEECH 2019, 2019, :4295-4299
[3]   Deep Lip Reading: a comparison of models and an online application [J].
Afouras, Triantafyllos ;
Chung, Joon Son ;
Zisserman, Andrew .
19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, :3514-3518
[4]   Acoustic beamforming for speaker diarization of meetings [J].
Anguera, Xavier ;
Wooters, Chuck ;
Hernando, Javier .
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2007, 15 (07) :2011-2022
[5]  
[Anonymous], 2019, ASRU
[6]   A comprehensive study of speech separation: spectrogram vs waveform separation [J].
Bahmaninezhad, Fahimeh ;
Wu, Jian ;
Gu, Rongzhi ;
Zhang, Shi-Xiong ;
Xu, Yong ;
Yu, Meng ;
Yu, Dong .
INTERSPEECH 2019, 2019, :4574-4578
[7]  
Chang XK, 2020, INT CONF ACOUST SPEE, P6134, DOI [10.1109/ICASSP40776.2020.9054029, 10.1109/icassp40776.2020.9054029]
[8]  
Chang XK, 2019, 2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), P237, DOI [10.1109/asru46091.2019.9003986, 10.1109/ASRU46091.2019.9003986]
[9]   Speaker-Targeted Audio-Visual Models for Speech Recognition in Cocktail-Party Environments [J].
Chao, Guan-Lin ;
Chan, William ;
Lane, Ian .
17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, :2120-2124
[10]  
Chen LW, 2019, INT CONF ACOUST SPEE, P705, DOI [10.1109/ICASSP.2019.8682470, 10.1109/icassp.2019.8682470]