Speech4Mesh: Speech-Assisted Monocular 3D Facial Reconstruction for Speech-Driven 3D Facial Animation

被引:3
作者
He, Shan [1 ,2 ]
He, Haonan [2 ]
Yang, Shuo [2 ]
Wu, Xiaoyan [2 ]
Xia, Pengcheng [2 ]
Yin, Bing [2 ]
Liu, Cong [2 ]
Dai, Lirong [1 ]
Xu, Chang [3 ]
机构
[1] Univ Sci & Technol China, Hefei, Peoples R China
[2] iFLYTEK Res, Hefei, Peoples R China
[3] Univ Sydney, Camperdown, NSW, Australia
来源
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023) | 2023年
关键词
D O I
10.1109/ICCV51070.2023.01305
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent audio2mesh-based methods have shown promising prospects for speech-driven 3D facial animation tasks. However, some intractable challenges are urgent to be settled. For example, the data-scarcity problem is intrinsically inevitable due to the difficulty of 4D data collection. Besides, current methods generally lack controllability on the animated face. To this end, we propose a novel framework named Speech4Mesh to consecutively generate 4D talking head data and train the audio2mesh network with the reconstructed meshes. In our framework, we first reconstruct the 4D talking head sequence based on the monocular videos. For precise capture of the talking-related variation on the face, we exploit the audio-visual alignment information from the video by employing a contrastive learning scheme. We next can train the audio2mesh network (e.g., FaceFormer) based on the generated 4D data. To get control of the animated talking face, we encode the speaking-unrelated factors (e.g., emotion, etc.) into an emotion embedding for manipulation. Finally, a differentiable renderer guarantees more accurate photometric details of the reconstruction and animation results. Empirical experiments demonstrate that the Speech4Mesh framework can not only outperform state-of-the-art reconstruction methods, especially on the lower-face part but also achieve better animation performance both perceptually and objectively after pre-trained on the synthesized data. Besides, we also verify that the proposed framework is able to explicitly control the emotion of the animated talking face.
引用
收藏
页码:14146 / 14156
页数:11
相关论文
共 76 条
[1]  
[Anonymous], 2016, Proc. Computer Vision and Pattern Recognition (CVPR), DOI [10.1109/CVPR.2016.262, DOI 10.1109/CVPR.2016.262]
[2]  
[Anonymous], 2020, EUR C COMP VIS
[3]  
[Anonymous], 2016, P 11 INT JOINT C COM
[4]   The Effects of Exogenous Testosterone on Men’s Moral Decision-Making [J].
Arnocky S. ;
Taylor S.M. ;
A. Olmstead N. ;
Carré J.M. .
Adaptive Human Behavior and Physiology, 2017, 3 (1) :1-13
[5]  
Baevski A, 2020, ADV NEUR IN, V33
[6]   A morphable model for the synthesis of 3D faces [J].
Blanz, V ;
Vetter, T .
SIGGRAPH 99 CONFERENCE PROCEEDINGS, 1999, :187-194
[7]   A 3D Morphable Model learnt from 10,000 faces [J].
Booth, James ;
Roussos, Anastasios ;
Zafeiriou, Stefanos ;
Ponniah, Allan ;
Dunaway, David .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :5543-5552
[8]   Rigid head motion in expressive speech animation: Analysis and synthesis [J].
Busso, Carlos ;
Deng, Zhigang ;
Grimm, Michael ;
Neumann, Ulrich ;
Narayanan, Shrikanth .
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2007, 15 (03) :1075-1086
[9]   Displaced Dynamic Expression Regression for Real-time Facial Tracking and Animation [J].
Cao, Chen ;
Hou, Qiming ;
Zhou, Kun .
ACM TRANSACTIONS ON GRAPHICS, 2014, 33 (04)
[10]   Expressive speech-driven facial animation [J].
Cao, Y ;
Tien, WC ;
Faloutsos, P ;
Pighin, F .
ACM TRANSACTIONS ON GRAPHICS, 2005, 24 (04) :1283-1302