Learning to lip read words by watching videos

被引：41

作者：

Chung, Joon Son ^{[1
]}

Zisserman, Andrew ^{[1
]}

机构：

[1] Univ Oxford, Dept Engn Sci, Visual Geometry Grp, Oxford, England

来源：

COMPUTER VISION AND IMAGE UNDERSTANDING | 2018年 / 173卷

基金：

英国工程与自然科学研究理事会;

关键词：

Lip reading; Lip synchronisation; Active speaker detection; Large vocabulary; Dataset; SPEECH; EXTRACTION; FEATURES;

D O I：

10.1016/j.cviu.2018.02.001

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Our aim is to recognise the words being spoken by a talking face, given only the video but not the audio. Existing works in this area have focussed on trying to recognise a small number of utterances in controlled environments (e.g. digits and alphabets), partially due to the shortage of suitable datasets. We make three novel contributions: first, we develop a pipeline for fully automated data collection from TV broadcasts. With this we have generated a dataset with over a million word instances, spoken by over a thousand different people; second, we develop a two-stream convolutional neural network that learns a joint embedding between the sound and the mouth motions from unlabelled data. We apply this network to the tasks of audio-to-video synchronisation and active speaker detection; third, we train convolutional and recurrent networks that are able to effectively learn and recognize hundreds of words from this large-scale dataset. In lip reading and in speaker detection, we demonstrate results that exceed the current state-of-the-art on public benchmark datasets.

引用

页码：76 / 85

页数：10

共 50 条

[1]

[Anonymous], AC SPEECH SIGN PROC

[2]

[Anonymous], 2015, LEARNING SPATIOTEMPO

[3]

[Anonymous], 2014, P BMVC

[4]

[Anonymous], PROC ICLR 2015

[5]

[Anonymous], ARXIV160308907

[6]

[Anonymous], 2013, IEEE T PATTERN ANAL, DOI DOI 10.1109/TPAMI.2012.59

[7]

[Anonymous], 2015, P IEEE INT C COMP VI

[8]

[Anonymous], P CVPR

[9]

[Anonymous], 2011, P 28 INT C MACH LEAR

[10]

[Anonymous], 2016, ARXIV161101599

← 1 2 3 4 5 →