Speech-Driven Gesture Generation Using Transformer-Based Denoising Diffusion Probabilistic Models

被引：0

作者：

Wu, Bowen ^{[1
,2
]}

Liu, Chaoran ^{[2
,3
,4
]}

Ishi, Carlos Toshinori ^{[2
,3
]}

Ishiguro, Hiroshi ^{[3
]}

机构：

[1] Osaka Univ, Grad Sch Engn Sci, Osaka 5650871, Japan

[2] RIKEN, Guardian Robot Project, Kyoto 6190288, Japan

[3] ATR, Hiroshi Ishiguro Labs, Kyoto 6190288, Japan

[4] Natl Inst Informat, Res & Dev Ctr Large Language Models, Tokyo 1018430, Japan

来源：

IEEE TRANSACTIONS ON HUMAN-MACHINE SYSTEMS | 2024年 / 54卷 / 06期

关键词：

Diffusion models; Data models; Transformers; Feature extraction; Noise reduction; Avatars; Skeleton; Robots; Motion segmentation; Motion capture; Co-speech gesture; deep learning; gesture-based interaction; social interaction;

D O I：

10.1109/THMS.2024.3456085

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

While it is crucial for human-like avatars to perform co-speech gestures, existing approaches struggle to generate natural and realistic movements. In the present study, a novel transformer-based denoising diffusion model is proposed to generate co-speech gestures. Moreover, we introduce a practical sampling trick for diffusion models to maintain the continuity between the generated motion segments while improving the within-segment motion likelihood and naturalness. Our model can be used for online generation since it generates gestures for a short segment of speech, e.g., 2 s. We evaluate our model on two large-scale speech-gesture datasets with finger movements using objective measurements and a user study, showing that our model outperforms all other baselines. Our user study is based on the Metahuman platform in the Unreal Engine, a popular tool for creating human-like avatars and motions.

引用

页码：733 / 742

页数：10

共 55 条

[1] Language2Pose: Natural Language Grounded Pose Forecasting
Ahuja, Chaitanya
Morency, Louis-Philippe
[J]. 2019 INTERNATIONAL CONFERENCE ON 3D VISION (3DV 2019), 2019, : 719 - 728
[2] Style-Controllable Speech-Driven Gesture Synthesis Using Normalising FlowsKeywords
Alexanderson, Simon
Henter, Gustav Eje
Kucherenko, Taras
Beskow, Jonas
[J]. COMPUTER GRAPHICS FORUM, 2020, 39 (02) : 487 - 496
[3] GestureDiffuCLIP: Gesture Diffusion Model with CLIP Latents
Ao, Tenglong
Zhang, Zeyi
Liu, Libin
[J]. ACM TRANSACTIONS ON GRAPHICS, 2023, 42 (04):
[4] Text2Gestures: A Transformer-Based Network for Generating Emotive Body Gestures for Virtual Agents
Bhattacharya, Uttaran
Rewkowski, Nicholas
Banerjee, Abhishek
Guhan, Pooja
Bera, Aniket
Manocha, Dinesh
[J]. 2021 IEEE VIRTUAL REALITY AND 3D USER INTERFACES (VR), 2021, : 160 - 169
[5] Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields
Cao, Zhe
Simon, Tomas
Wei, Shih-En
Sheikh, Yaser
[J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 1302 - 1310
[6] Chen N., 2021, P 9 INT C LEARN REPR
[7] Choutas V., 2020, ECCV, P20, DOI DOI 10.1007/978-3-030-58607-22
[8] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[9] Dhariwal P, 2021, ADV NEUR IN, V34
[10] Investigating the use of recurrent motion modelling for speech gesture generation
Ferstl, Ylva
McDonnell, Rachel
[J]. 18TH ACM INTERNATIONAL CONFERENCE ON INTELLIGENT VIRTUAL AGENTS (IVA'18), 2018, : 93 - 98

← 1 2 3 4 5 6 →