Audio-visual Multi-channel Recognition of Overlapped Speech

被引：11

作者：

Yu, Jianwei ^{[1
,2
]}

Wu, Bo ^{[2
]}

Gu, Rongzhi ^{[2
]}

Zhang, Shi-Xiong ^{[2
]}

Chen, Lianwu ^{[2
]}

Xu, Yong ^{[2
]}

Yu, Meng ^{[2
]}

Su, Dan ^{[2
]}

Yu, Dong ^{[2
]}

Liu, Xunying ^{[1
]}

Meng, Helen ^{[1
]}

机构：

[1] Chinese Univ Hong Kong, Hong Kong, Peoples R China

[2] Tencent AI Lab, Bellevue, WA 98004 USA

来源：

INTERSPEECH 2020 | 2020年

关键词：

Overlapped speech recognition; Speech separation; Audio-visual; Multi-channel; NEURAL-NETWORKS; MODELS;

D O I：

10.21437/Interspeech.2020-2346

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

Automatic speech recognition (ASR) of overlapped speech remains a highly challenging task to date. To this end, multi-channel microphone array data are widely used in state-of-the-art ASR systems. Motivated by the invariance of visual modality to acoustic signal corruption, this paper presents an audio-visual multi-channel overlapped speech recognition system featuring tightly integrated separation front-end and recognition back-end. A series of audio-visual multi-channel speech separation front-end components based on TF masking, filter&sum and mask-based MVDR beamforming approaches were developed. To reduce the error cost mismatch between the separation and recognition components, they were jointly fine-tuned using the connectionist temporal classification (CTC) loss function, or a multi-task criterion interpolation with scale-invariant signal to noise ratio (Si-SNR) error cost. Experiments suggest that the proposed multi-channel AVSR system outperforms the baseline audio-only ASR system by up to 6.81% (26.83% relative) and 22.22% (56.87% relative) absolute word error rate (WER) reduction on overlapped speech constructed using either simulation or replaying of the lipreading sentence 2 (LRS2) dataset respectively.

引用

页码：3496 / 3500

页数：5

共 39 条

[1] Deep Audio-Visual Speech Recognition [J].

Afouras, Triantafyllos ;

Chung, Joon Son ;

Senior, Andrew ;

Vinyals, Oriol ;

Zisserman, Andrew .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (12) :8717-8727

[2] My lips are concealed: Audio-visual speech enhancement through obstructions [J].

Afouras, Triantafyllos ;

Chung, Joon Son ;

Zisserman, Andrew .

INTERSPEECH 2019, 2019, :4295-4299

[3] Deep Lip Reading: a comparison of models and an online application [J].

Afouras, Triantafyllos ;

Chung, Joon Son ;

Zisserman, Andrew .

19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, :3514-3518

[4] Acoustic beamforming for speaker diarization of meetings [J].

Anguera, Xavier ;

Wooters, Chuck ;

Hernando, Javier .

IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2007, 15 (07) :2011-2022

[5]

[Anonymous], 2019, ASRU

[6] A comprehensive study of speech separation: spectrogram vs waveform separation [J].

Bahmaninezhad, Fahimeh ;

Wu, Jian ;

Gu, Rongzhi ;

Zhang, Shi-Xiong ;

Xu, Yong ;

Yu, Meng ;

Yu, Dong .

INTERSPEECH 2019, 2019, :4574-4578

[7]

Chang XK, 2020, INT CONF ACOUST SPEE, P6134, DOI [10.1109/ICASSP40776.2020.9054029, 10.1109/icassp40776.2020.9054029]

[8]

Chang XK, 2019, 2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), P237, DOI [10.1109/asru46091.2019.9003986, 10.1109/ASRU46091.2019.9003986]

[9] Speaker-Targeted Audio-Visual Models for Speech Recognition in Cocktail-Party Environments [J].

Chao, Guan-Lin ;

Chan, William ;

Lane, Ian .

17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, :2120-2124

[10]

Chen LW, 2019, INT CONF ACOUST SPEE, P705, DOI [10.1109/ICASSP.2019.8682470, 10.1109/icassp.2019.8682470]

← 1 2 3 4 →