Speech-Driven Gesture Generation Using Transformer-Based Denoising Diffusion Probabilistic Models

被引:0
作者
Wu, Bowen [1 ,2 ]
Liu, Chaoran [2 ,3 ,4 ]
Ishi, Carlos Toshinori [2 ,3 ]
Ishiguro, Hiroshi [3 ]
机构
[1] Osaka Univ, Grad Sch Engn Sci, Osaka 5650871, Japan
[2] RIKEN, Guardian Robot Project, Kyoto 6190288, Japan
[3] ATR, Hiroshi Ishiguro Labs, Kyoto 6190288, Japan
[4] Natl Inst Informat, Res & Dev Ctr Large Language Models, Tokyo 1018430, Japan
关键词
Diffusion models; Data models; Transformers; Feature extraction; Noise reduction; Avatars; Skeleton; Robots; Motion segmentation; Motion capture; Co-speech gesture; deep learning; gesture-based interaction; social interaction;
D O I
10.1109/THMS.2024.3456085
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
While it is crucial for human-like avatars to perform co-speech gestures, existing approaches struggle to generate natural and realistic movements. In the present study, a novel transformer-based denoising diffusion model is proposed to generate co-speech gestures. Moreover, we introduce a practical sampling trick for diffusion models to maintain the continuity between the generated motion segments while improving the within-segment motion likelihood and naturalness. Our model can be used for online generation since it generates gestures for a short segment of speech, e.g., 2 s. We evaluate our model on two large-scale speech-gesture datasets with finger movements using objective measurements and a user study, showing that our model outperforms all other baselines. Our user study is based on the Metahuman platform in the Unreal Engine, a popular tool for creating human-like avatars and motions.
引用
收藏
页码:733 / 742
页数:10
相关论文
共 55 条
  • [1] Language2Pose: Natural Language Grounded Pose Forecasting
    Ahuja, Chaitanya
    Morency, Louis-Philippe
    [J]. 2019 INTERNATIONAL CONFERENCE ON 3D VISION (3DV 2019), 2019, : 719 - 728
  • [2] Style-Controllable Speech-Driven Gesture Synthesis Using Normalising FlowsKeywords
    Alexanderson, Simon
    Henter, Gustav Eje
    Kucherenko, Taras
    Beskow, Jonas
    [J]. COMPUTER GRAPHICS FORUM, 2020, 39 (02) : 487 - 496
  • [3] GestureDiffuCLIP: Gesture Diffusion Model with CLIP Latents
    Ao, Tenglong
    Zhang, Zeyi
    Liu, Libin
    [J]. ACM TRANSACTIONS ON GRAPHICS, 2023, 42 (04):
  • [4] Text2Gestures: A Transformer-Based Network for Generating Emotive Body Gestures for Virtual Agents
    Bhattacharya, Uttaran
    Rewkowski, Nicholas
    Banerjee, Abhishek
    Guhan, Pooja
    Bera, Aniket
    Manocha, Dinesh
    [J]. 2021 IEEE VIRTUAL REALITY AND 3D USER INTERFACES (VR), 2021, : 160 - 169
  • [5] Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields
    Cao, Zhe
    Simon, Tomas
    Wei, Shih-En
    Sheikh, Yaser
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 1302 - 1310
  • [6] Chen N., 2021, P 9 INT C LEARN REPR
  • [7] Choutas V., 2020, ECCV, P20, DOI DOI 10.1007/978-3-030-58607-22
  • [8] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
  • [9] Dhariwal P, 2021, ADV NEUR IN, V34
  • [10] Investigating the use of recurrent motion modelling for speech gesture generation
    Ferstl, Ylva
    McDonnell, Rachel
    [J]. 18TH ACM INTERNATIONAL CONFERENCE ON INTELLIGENT VIRTUAL AGENTS (IVA'18), 2018, : 93 - 98