VisemeNet: Audio-Driven Animator-Centric Speech Animation

被引：95

作者：

Zhou, Yang ^{[1
]}

Xu, Zhan ^{[1
]}

Landreth, Chris ^{[2
]}

Kalogerakis, Evangelos ^{[1
]}

Maji, Subhransu ^{[1
]}

Singh, Karan ^{[2
]}

机构：

[1] Univ Massachusetts, Amherst, MA 01003 USA

[2] Univ Toronto, Toronto, ON, Canada

来源：

ACM TRANSACTIONS ON GRAPHICS | 2018年 / 37卷 / 04期

基金：

美国国家科学基金会; 加拿大自然科学与工程研究理事会;

关键词：

facial animation; neural networks;

D O I：

10.1145/3197517.3201292

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

We present a novel deep-learning based approach to producing animatorcentric speech motion curves that drive a JALI or standard FACS-based production face-rig, directly from input audio. Our three-stage Long Short-Term Memory (LSTM) network architecture is motivated by psycho-linguistic insights: segmenting speech audio into a stream of phonetic-groups is sufficient for viseme construction; speech styles like mumbling or shouting are strongly co-related to the motion of facial landmarks; and animator style is encoded in viseme motion curve profiles. Our contribution is an automatic real-time lip-synchronization from audio solution that integrates seamlessly into existing animation pipelines. We evaluate our results by: cross-validation to ground-truth data; animator critique and edits; visual comparison to recent deep-learning lip-synchronization solutions; and showing our approach to be resilient to diversity in speaker and language.

引用

页数：10

共 39 条

[1]

Anderson R., 2013, P CVPR

[2]

[Anonymous], 2010, Machine audition: principles, algorithms and systems: principles, algorithms and systems

[3]

[Anonymous], 1997, Neural Computation

[4]

[Anonymous], 2014, P ICML

[5]

[Anonymous], 2017, ACM T GRAPHIC, DOI DOI 10.1145/3130800.31310887

[6]

[Anonymous], EMNLP

[7]

[Anonymous], J ACOUSTICAL SOC AM

[8]

[Anonymous], J MACHINE LEARNING R

[9]

Bailly G., 2012, AUDIOVISUAL SPEECH P

[10]

Bailly Gerard, 1997, SPEECH COMMUN, V22, P2

← 1 2 3 4 →