MotionDiffuse: Text-Driven Human Motion Generation With Diffusion Model

被引:28
作者
Zhang, Mingyuan [1 ]
Cai, Zhongang [1 ,2 ,3 ]
Pan, Liang [1 ]
Hong, Fangzhou [1 ]
Guo, Xinying [1 ]
Yang, Lei [2 ,3 ]
Liu, Ziwei [1 ]
机构
[1] Nanyang Technol Univ, S Lab, Singapore 639798, Singapore
[2] SenseTime Res, Shenzhen 518100, Peoples R China
[3] Shanghai AI Lab, Shenzhen 518100, Peoples R China
关键词
Pipelines; Task analysis; Noise reduction; Transformers; Training; Probabilistic logic; Decoding; Conditional motion generation; diffusion model; motion synthesis; text-driven generation;
D O I
10.1109/TPAMI.2024.3355414
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Human motion modeling is important for many modern graphics applications, which typically require professional skills. In order to remove the skill barriers for laymen, recent motion generation methods can directly generate human motions conditioned on natural languages. However, it remains challenging to achieve diverse and fine-grained motion generation with various text inputs. To address this problem, we propose <bold>MotionDiffuse</bold>, one of the first diffusion model-based text-driven motion generation frameworks, which demonstrates several desired properties over existing methods. 1) Probabilistic Mapping. Instead of a deterministic language-motion mapping, MotionDiffuse generates motions through a series of denoising steps in which variations are injected. 2) Realistic Synthesis. MotionDiffuse excels at modeling complicated data distribution and generating vivid motion sequences. 3) Multi-Level Manipulation. MotionDiffuse responds to fine-grained instructions on body parts, and arbitrary-length motion synthesis with time-varied text prompts. Our experiments show MotionDiffuse outperforms existing SoTA methods by convincing margins on text-driven motion generation and action-conditioned motion generation. A qualitative analysis further demonstrates MotionDiffuse's controllability for comprehensive motion generation.
引用
收藏
页码:4115 / 4128
页数:14
相关论文
共 85 条
  • [11] Implicit Neural Representations for Variable Length Human Motion Generation
    Cervantes, Pablo
    Sekikawa, Yusuke
    Sato, Ikuro
    Shinoda, Koichi
    [J]. COMPUTER VISION - ECCV 2022, PT XVII, 2022, 13677 : 356 - 372
  • [12] Chung C.-H., 2021, P IEEE CVF INT C COM, p13 465
  • [13] Dhariwal A., 2021, inProc. Adv. Neural Inf. Process. Syst.
  • [14] Dinh L, 2015, Arxiv, DOI arXiv:1410.8516
  • [15] Dinh L, 2017, Arxiv, DOI [arXiv:1605.08803, 10.48550/arXiv.1605.08803]
  • [16] Futrelle R, 1978, P IEEE C PATT REC IM, P405
  • [17] The visual analysis of human movement: A survey
    Gavrila, DM
    [J]. COMPUTER VISION AND IMAGE UNDERSTANDING, 1999, 73 (01) : 82 - 98
  • [18] Synthesis of Compositional Animations from Textual Descriptions
    Ghosh, Anindita
    Cheema, Noshaba
    Oguz, Cennet
    Theobalt, Christian
    Slusallek, Philipp
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 1376 - 1386
  • [19] Goodfellow IJ, 2014, ADV NEUR IN, V27, P2672
  • [20] AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions
    Gu, Chunhui
    Sun, Chen
    Ross, David A.
    Vondrick, Carl
    Pantofaru, Caroline
    Li, Yeqing
    Vijayanarasimhan, Sudheendra
    Toderici, George
    Ricco, Susanna
    Sukthankar, Rahul
    Schmid, Cordelia
    Malik, Jitendra
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6047 - 6056