Executing your Commands via Motion Diffusion in Latent Space

被引:83
作者
Chen, Xin [1 ]
Jiang, Biao [2 ]
Liu, Wen [1 ]
Huang, Zilong [1 ]
Fu, Bin [1 ]
Chen, Tao [2 ]
Yu, Gang [1 ]
机构
[1] Tencent PCG, Shenyang, Peoples R China
[2] Fudan Univ, Shanghai, Peoples R China
来源
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2023年
关键词
D O I
10.1109/CVPR52729.2023.01726
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We study a challenging task, conditional human motion generation, which produces plausible human motion sequences according to various conditional inputs, such as action classes or textual descriptors. Since human motions are highly diverse and have a property of quite different distribution from conditional modalities, such as textual descriptors in natural languages, it is hard to learn a probabilistic mapping from the desired conditional modality to the human motion sequences. Besides, the raw motion data from the motion capture system might be redundant in sequences and contain noises; directly modeling the joint distribution over the raw motion sequences and conditional modalities would need a heavy computational overhead and might result in artifacts introduced by the captured noises. To learn a better representation of the various human motion sequences, we first design a powerful Variational AutoEncoder (VAE) and arrive at a representative and low-dimensional latent code for a human motion sequence. Then, instead of using a diffusion model to establish the connections between the raw motion sequences and the conditional inputs, we perform a diffusion process on the motion latent space. Our proposed Motion Latent-based Diffusion model (MLD) could produce vivid motion sequences conforming to the given conditional inputs and substantially reduce the computational overhead in both the training and inference stages. Extensive experiments on various human motion generation tasks demonstrate that our MLD achieves significant improvements over the state-of-the-art methods among extensive human motion generation tasks, with two orders of magnitude faster than previous diffusion models on raw motion sequences.
引用
收藏
页码:18000 / 18010
页数:11
相关论文
共 80 条
  • [1] Ahn H, 2018, IEEE INT CONF ROBOT, P5915
  • [2] Language2Pose: Natural Language Grounded Pose Forecasting
    Ahuja, Chaitanya
    Morency, Louis-Philippe
    [J]. 2019 INTERNATIONAL CONFERENCE ON 3D VISION (3DV 2019), 2019, : 719 - 728
  • [3] [Anonymous], P IEEE CVF C COMP VI
  • [4] Arjovsky M, 2017, PR MACH LEARN RES, V70
  • [5] Bao Fan, 2022, ARXIV220912152
  • [6] Text2Gestures: A Transformer-Based Network for Generating Emotive Body Gestures for Virtual Agents
    Bhattacharya, Uttaran
    Rewkowski, Nicholas
    Banerjee, Abhishek
    Guhan, Pooja
    Bera, Aniket
    Manocha, Dinesh
    [J]. 2021 IEEE VIRTUAL REALITY AND 3D USER INTERFACES (VR), 2021, : 160 - 169
  • [7] Sparse Photometric 3D Face Reconstruction Guided by Morphable Models
    Cao, Xuan
    Chen, Zhang
    Chen, Anpei
    Chen, Xin
    Li, Shiying
    Yu, Jingyi
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 4635 - 4644
  • [8] Implicit Neural Representations for Variable Length Human Motion Generation
    Cervantes, Pablo
    Sekikawa, Yusuke
    Sato, Ikuro
    Shinoda, Koichi
    [J]. COMPUTER VISION - ECCV 2022, PT XVII, 2022, 13677 : 356 - 372
  • [9] SportsCap: Monocular 3D Human Motion Capture and Fine-Grained Understanding in Challenging Sports Videos
    Chen, Xin
    Pang, Anqi
    Yang, Wei
    Ma, Yuexin
    Xu, Lan
    Yu, Jingyi
    [J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2021, 129 (10) : 2846 - 2864
  • [10] Chen Xin, 2022, ARXIV221015134