Learning to lip read words by watching videos

被引:41
作者
Chung, Joon Son [1 ]
Zisserman, Andrew [1 ]
机构
[1] Univ Oxford, Dept Engn Sci, Visual Geometry Grp, Oxford, England
基金
英国工程与自然科学研究理事会;
关键词
Lip reading; Lip synchronisation; Active speaker detection; Large vocabulary; Dataset; SPEECH; EXTRACTION; FEATURES;
D O I
10.1016/j.cviu.2018.02.001
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Our aim is to recognise the words being spoken by a talking face, given only the video but not the audio. Existing works in this area have focussed on trying to recognise a small number of utterances in controlled environments (e.g. digits and alphabets), partially due to the shortage of suitable datasets. We make three novel contributions: first, we develop a pipeline for fully automated data collection from TV broadcasts. With this we have generated a dataset with over a million word instances, spoken by over a thousand different people; second, we develop a two-stream convolutional neural network that learns a joint embedding between the sound and the mouth motions from unlabelled data. We apply this network to the tasks of audio-to-video synchronisation and active speaker detection; third, we train convolutional and recurrent networks that are able to effectively learn and recognize hundreds of words from this large-scale dataset. In lip reading and in speaker detection, we demonstrate results that exceed the current state-of-the-art on public benchmark datasets.
引用
收藏
页码:76 / 85
页数:10
相关论文
共 50 条
[1]  
[Anonymous], AC SPEECH SIGN PROC
[2]  
[Anonymous], 2015, LEARNING SPATIOTEMPO
[3]  
[Anonymous], 2014, P BMVC
[4]  
[Anonymous], PROC ICLR 2015
[5]  
[Anonymous], ARXIV160308907
[6]  
[Anonymous], 2013, IEEE T PATTERN ANAL, DOI DOI 10.1109/TPAMI.2012.59
[7]  
[Anonymous], 2015, P IEEE INT C COMP VI
[8]  
[Anonymous], P CVPR
[9]  
[Anonymous], 2011, P 28 INT C MACH LEAR
[10]  
[Anonymous], 2016, ARXIV161101599