You Said That?: Synthesising Talking Faces from Audio

被引:114
作者
Jamaludin, Amir [1 ]
Chung, Joon Son [1 ]
Zisserman, Andrew [1 ]
机构
[1] Univ Oxford, Dept Engn Sci, Parks Rd, Oxford OX1 3PJ, England
基金
英国工程与自然科学研究理事会;
关键词
Computer vision; Machine learning; Visual speech synthesis; Video synthesis;
D O I
10.1007/s11263-019-01150-y
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We describe a method for generating a video of a talking face. The method takes still images of the target face and an audio speech segment as inputs, and generates a video of the target face lip synched with the audio. The method runs in real time and is applicable to faces and audio not seen at training time. To achieve this we develop an encoder-decoder convolutional neural network (CNN) model that uses a joint embedding of the face and audio to generate synthesised talking face video frames. The model is trained on unlabelled videos using cross-modal self-supervision. We also propose methods to re-dub videos by visually blending the generated face into the source video frame using a multi-stream CNN model.
引用
收藏
页码:1767 / 1779
页数:13
相关论文
共 50 条
[1]  
Afouras T., 2018, arXiv preprint arXiv 1809. 02108
[2]  
[Anonymous], IEEE INT C AC SPEECH
[3]  
[Anonymous], 2016, P IEEE C COMPUTER VI
[4]  
[Anonymous], P IEEE C COMP VIS PA
[5]  
[Anonymous], 2016, ADV NEURAL INFORM PR
[6]  
[Anonymous], 2017, ADV NEURAL INFORM PR
[7]  
[Anonymous], 2017, P 30 IEEE C COMP VIS, DOI DOI 10.1109/CVPR.2017.632
[8]  
[Anonymous], P ACM MULT C
[9]  
[Anonymous], P IEEE C COMP VIS PA
[10]  
[Anonymous], ADV NEURAL INFORM PR