CodeTalker: Speech-Driven 3D Facial Animation with Discrete Motion Prior

被引:69
作者
Xing, Jinbo [1 ]
Xia, Menghan [2 ]
Zhang, Yuechen [1 ]
Cun, Xiaodong [2 ]
Wang, Jue [2 ]
Wong, Tien-Tsin [1 ]
机构
[1] Chinese Univ Hong Kong, Hong Kong, Peoples R China
[2] Tencent AI Lab, Shenzhen, Peoples R China
来源
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2023年
关键词
D O I
10.1109/CVPR52729.2023.01229
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Speech-driven 3D facial animation has been widely studied, yet there is still a gap to achieving realism and vividness due to the highly ill-posed nature and scarcity of audiovisual data. Existing works typically formulate the cross-modal mapping into a regression task, which suffers from the regression-to-mean problem leading to over-smoothed facial motions. In this paper, we propose to cast speech-driven facial animation as a code query task in a finite proxy space of the learned codebook, which effectively promotes the vividness of the generated motions by reducing the cross-modal mapping uncertainty. The codebook is learned by self-reconstruction over real facial motions and thus embedded with realistic facial motion priors. Over the discrete motion space, a temporal autoregressive model is employed to sequentially synthesize facial motions from the input speech signal, which guarantees lip-sync as well as plausible facial expressions. We demonstrate that our approach outperforms current state-of-the-art methods both qualitatively and quantitatively. Also, a user study further justifies our superiority in perceptual quality. Code and video demo are available at https://doubiiu.github.io/projects/codetalker.
引用
收藏
页码:12780 / 12790
页数:11
相关论文
共 71 条
[1]  
Alghamdi Mohammed M, 2022, ACM INT C MULT MM
[2]  
[Anonymous], 2012, Audiovisual Speech Processing, DOI [DOI 10.1017/CBO9780511843891.014, 10.1017/CBO9780511843891.014]
[3]  
[Anonymous], ACM T GRAPHICS TOG
[4]   Rhythmic Gesticulator: Rhythm-Aware Co-Speech Gesture Synthesis with Hierarchical Neural Embeddings [J].
Ao, Tenglong ;
Gao, Qingzhe ;
Lou, Yuke ;
Chen, Baoquan ;
Liu, Libin .
ACM TRANSACTIONS ON GRAPHICS, 2022, 41 (06)
[5]  
Bengio Yoshua, 2013, Statistical Language and Speech Processing. First International Conference, SLSP 2013. Proceedings: LNCS 7978, P1, DOI 10.1007/978-3-642-39593-2_1
[6]   Expressive speech-driven facial animation [J].
Cao, Y ;
Tien, WC ;
Faloutsos, P ;
Pighin, F .
ACM TRANSACTIONS ON GRAPHICS, 2005, 24 (04) :1283-1302
[7]  
Chen C., 2016, PLOS ONE, V12, P1
[8]   Lip Movements Generation at a Glance [J].
Chen, Lele ;
Li, Zhiheng ;
Maddox, Ross K. ;
Duan, Zhiyao ;
Xu, Chenliang .
COMPUTER VISION - ECCV 2018, PT VII, 2018, 11211 :538-553
[9]  
Chen Liunian Harold, 2020, ECCV
[10]  
Chung J.S., 2016, LECT NOTES COMPUT SC, P251, DOI DOI 10.1007/978-3-319-54427-4_19