CodeTalker: Speech-Driven 3D Facial Animation with Discrete Motion Prior

被引：69

作者：

Xing, Jinbo ^{[1
]}

Xia, Menghan ^{[2
]}

Zhang, Yuechen ^{[1
]}

Cun, Xiaodong ^{[2
]}

Wang, Jue ^{[2
]}

Wong, Tien-Tsin ^{[1
]}

机构：

[1] Chinese Univ Hong Kong, Hong Kong, Peoples R China

[2] Tencent AI Lab, Shenzhen, Peoples R China

来源：

2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2023年

关键词：

D O I：

10.1109/CVPR52729.2023.01229

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Speech-driven 3D facial animation has been widely studied, yet there is still a gap to achieving realism and vividness due to the highly ill-posed nature and scarcity of audiovisual data. Existing works typically formulate the cross-modal mapping into a regression task, which suffers from the regression-to-mean problem leading to over-smoothed facial motions. In this paper, we propose to cast speech-driven facial animation as a code query task in a finite proxy space of the learned codebook, which effectively promotes the vividness of the generated motions by reducing the cross-modal mapping uncertainty. The codebook is learned by self-reconstruction over real facial motions and thus embedded with realistic facial motion priors. Over the discrete motion space, a temporal autoregressive model is employed to sequentially synthesize facial motions from the input speech signal, which guarantees lip-sync as well as plausible facial expressions. We demonstrate that our approach outperforms current state-of-the-art methods both qualitatively and quantitatively. Also, a user study further justifies our superiority in perceptual quality. Code and video demo are available at https://doubiiu.github.io/projects/codetalker.

引用

页码：12780 / 12790

页数：11

共 71 条

[1]

Alghamdi Mohammed M, 2022, ACM INT C MULT MM

[2]

[Anonymous], 2012, Audiovisual Speech Processing, DOI [DOI 10.1017/CBO9780511843891.014, 10.1017/CBO9780511843891.014]

[3]

[Anonymous], ACM T GRAPHICS TOG

[4] Rhythmic Gesticulator: Rhythm-Aware Co-Speech Gesture Synthesis with Hierarchical Neural Embeddings [J].

Ao, Tenglong ;

Gao, Qingzhe ;

Lou, Yuke ;

Chen, Baoquan ;

Liu, Libin .

ACM TRANSACTIONS ON GRAPHICS, 2022, 41 (06)

[5]

Bengio Yoshua, 2013, Statistical Language and Speech Processing. First International Conference, SLSP 2013. Proceedings: LNCS 7978, P1, DOI 10.1007/978-3-642-39593-2_1

[6] Expressive speech-driven facial animation [J].

Cao, Y ;

Tien, WC ;

Faloutsos, P ;

Pighin, F .

ACM TRANSACTIONS ON GRAPHICS, 2005, 24 (04) :1283-1302

[7]

Chen C., 2016, PLOS ONE, V12, P1

[8] Lip Movements Generation at a Glance [J].

Chen, Lele ;

Li, Zhiheng ;

Maddox, Ross K. ;

Duan, Zhiyao ;

Xu, Chenliang .

COMPUTER VISION - ECCV 2018, PT VII, 2018, 11211 :538-553

[9]

Chen Liunian Harold, 2020, ECCV

[10]

Chung J.S., 2016, LECT NOTES COMPUT SC, P251, DOI DOI 10.1007/978-3-319-54427-4_19

← 1 2 3 4 5 6 7 8 →