Transformer-Based Video Front-Ends for Audio-Visual Speech Recognition for Single and Multi-Person Video

被引:17
作者
Serdyuk, Dmitriy [1 ]
Braga, Otavio [1 ]
Siohan, Olivier [1 ]
机构
[1] Google, 111 8th Ave, New York, NY 10011 USA
来源
INTERSPEECH 2022 | 2022年
关键词
Audio-visual speech recognition; lip reading; video transformer; deep learning;
D O I
10.21437/Interspeech.2022-10920
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Audio-visual automatic speech recognition (AV-ASR) extends speech recognition by introducing the video modality as an additional source of information. In this work, the information contained in the motion of the speaker's mouth is used to augment the audio features. The video modality is traditionally processed with a 3D convolutional neural network (e.g. 3D version of VGG). Recently, image transformer networks [1] demonstrated the ability to extract rich visual features for image classification tasks. Here, we propose to replace the 3D convolution with a video transformer to extract visual features. We train our baselines and the proposed model on a large scale corpus of YouTube videos. The performance of our approach is evaluated on a labeled subset of YouTube videos as well as on the LRS3-TED public corpus. Our best video-only model obtains 34.9% WER on YTDEV18 and 19.3% on LRS3-TED, a 10% and 9% relative improvements over our convolutional baseline. We achieve the state of the art performance of the audio-visual recognition on the LRS3-TED after fine-tuning our model (1.6% WER). In addition, in a series of experiments on multi-person AV-ASR, we obtained an average relative reduction of 2% WER over our convolutional video frontend.
引用
收藏
页码:2833 / 2837
页数:5
相关论文
共 34 条
[1]  
Afouras T., 2018, arXiv preprint arXiv:1809.00496
[2]   Deep Audio-Visual Speech Recognition [J].
Afouras, Triantafyllos ;
Chung, Joon Son ;
Senior, Andrew ;
Vinyals, Oriol ;
Zisserman, Andrew .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (12) :8717-8727
[3]   My lips are concealed: Audio-visual speech enhancement through obstructions [J].
Afouras, Triantafyllos ;
Chung, Joon Son ;
Zisserman, Andrew .
INTERSPEECH 2019, 2019, :4295-4299
[4]  
[Anonymous], CVPR
[5]   ViViT: A Video Vision Transformer [J].
Arnab, Anurag ;
Dehghani, Mostafa ;
Heigold, Georg ;
Sun, Chen ;
Lucic, Mario ;
Schmid, Cordelia .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :6816-6826
[6]  
Assael Yannis M, 2016, ARXIV161101599
[7]  
Bahdanau D, 2016, Arxiv, DOI arXiv:1409.0473
[8]  
Bertasius G., 2021, arXiv
[9]  
Braga Otavio, 2021, ICASSP
[10]  
Braga Otavio, 2020, ICASSP