Style-Controllable Speech-Driven Gesture Synthesis Using Normalising FlowsKeywords

被引:132
作者
Alexanderson, Simon [1 ]
Henter, Gustav Eje [1 ]
Kucherenko, Taras [1 ]
Beskow, Jonas [1 ]
机构
[1] KTH Royal Inst Technol, Div Speech Mus & Hearing, Stockholm, Sweden
基金
瑞典研究理事会;
关键词
CCS Concepts; Computing methodologies -> Motion capture; Animation; Neural networks; Gestures; Motion capture; Data-driven animation; Character control; Probabilistic models; MODEL;
D O I
10.1111/cgf.13946
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Automatic synthesis of realistic gestures promises to transform the fields of animation, avatars and communicative agents. In off-line applications, novel tools can alter the role of an animator to that of a director, who provides only high-level input for the desired animation; a learned network then translates these instructions into an appropriate sequence of body poses. In interactive scenarios, systems for generating natural animations on the fly are key to achieving believable and relatable characters. In this paper we address some of the core issues towards these ends. By adapting a deep learning-based motion synthesis method called MoGlow, we propose a new generative model for generating state-of-the-art realistic speech-driven gesticulation. Owing to the probabilistic nature of the approach, our model can produce a battery of different, yet plausible, gestures given the same input speech signal. Just like humans, this gives a rich natural variation of motion. We additionally demonstrate the ability to exert directorial control over the output style, such as gesture level, speed, symmetry and spacial extent. Such control can be leveraged to convey a desired character personality or mood. We achieve all this without any manual annotation of the data. User studies evaluating upper-body gesticulation confirm that the generated motions are natural and well match the input speech. Our method scores above all prior systems and baselines on these measures, and comes close to the ratings of the original recorded motions. We furthermore find that we can accurately control gesticulation styles without unnecessarily compromising perceived naturalness. Finally, we also demonstrate an application of the same method to full-body gesticulation, including the synthesis of stepping motion and stance.
引用
收藏
页码:487 / 496
页数:10
相关论文
共 55 条
[1]   To React or not to React: End-to-End Visual Pose Forecasting for Personalized Avatar during Dyadic Conversations [J].
Ahuja, Chaitanya ;
Ma, Shugao ;
Morency, Louis-Philippe ;
Sheikh, Yaser .
ICMI'19: PROCEEDINGS OF THE 2019 INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2019, :74-84
[2]  
[Anonymous], 2018, ADV NEURAL INF PROCE
[3]  
[Anonymous], 2016, C NEUR INF PROC SYST
[4]  
[Anonymous], 2013, EFFECT POSTURE DYNAM, DOI DOI 10.1145/2492494.2492500
[5]  
[Anonymous], 2019, P ICLR
[6]  
[Anonymous], 2014, P ICLR
[7]  
[Anonymous], 2015, Optimization
[8]  
Aristidou A., 2017, P SCA, P9
[9]  
Bergmann K, 2009, LECT NOTES ARTIF INT, V5773, P76, DOI 10.1007/978-3-642-04380-2_12
[10]   Style machines [J].
Brand, M ;
Hertzmann, A .
SIGGRAPH 2000 CONFERENCE PROCEEDINGS, 2000, :183-192