MotionDiffuse: Text-Driven Human Motion Generation With Diffusion Model

被引：28

作者：

Zhang, Mingyuan ^{[1
]}

Cai, Zhongang ^{[1
,2
,3
]}

Pan, Liang ^{[1
]}

Hong, Fangzhou ^{[1
]}

Guo, Xinying ^{[1
]}

Yang, Lei ^{[2
,3
]}

Liu, Ziwei ^{[1
]}

机构：

[1] Nanyang Technol Univ, S Lab, Singapore 639798, Singapore

[2] SenseTime Res, Shenzhen 518100, Peoples R China

[3] Shanghai AI Lab, Shenzhen 518100, Peoples R China

来源：

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE | 2024年 / 46卷 / 06期

关键词：

Pipelines; Task analysis; Noise reduction; Transformers; Training; Probabilistic logic; Decoding; Conditional motion generation; diffusion model; motion synthesis; text-driven generation;

D O I：

10.1109/TPAMI.2024.3355414

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Human motion modeling is important for many modern graphics applications, which typically require professional skills. In order to remove the skill barriers for laymen, recent motion generation methods can directly generate human motions conditioned on natural languages. However, it remains challenging to achieve diverse and fine-grained motion generation with various text inputs. To address this problem, we propose <bold>MotionDiffuse</bold>, one of the first diffusion model-based text-driven motion generation frameworks, which demonstrates several desired properties over existing methods. 1) Probabilistic Mapping. Instead of a deterministic language-motion mapping, MotionDiffuse generates motions through a series of denoising steps in which variations are injected. 2) Realistic Synthesis. MotionDiffuse excels at modeling complicated data distribution and generating vivid motion sequences. 3) Multi-Level Manipulation. MotionDiffuse responds to fine-grained instructions on body parts, and arbitrary-length motion synthesis with time-varied text prompts. Our experiments show MotionDiffuse outperforms existing SoTA methods by convincing margins on text-driven motion generation and action-conditioned motion generation. A qualitative analysis further demonstrates MotionDiffuse's controllability for comprehensive motion generation.

引用

页码：4115 / 4128

页数：14

共 85 条

[11] Implicit Neural Representations for Variable Length Human Motion Generation
Cervantes, Pablo
Sekikawa, Yusuke
Sato, Ikuro
Shinoda, Koichi
[J]. COMPUTER VISION - ECCV 2022, PT XVII, 2022, 13677 : 356 - 372
[12] Chung C.-H., 2021, P IEEE CVF INT C COM, p13 465
[13] Dhariwal A., 2021, inProc. Adv. Neural Inf. Process. Syst.
[14] Dinh L, 2015, Arxiv, DOI arXiv:1410.8516
[15] Dinh L, 2017, Arxiv, DOI [arXiv:1605.08803, 10.48550/arXiv.1605.08803]
[16] Futrelle R, 1978, P IEEE C PATT REC IM, P405
[17] The visual analysis of human movement: A survey
Gavrila, DM
[J]. COMPUTER VISION AND IMAGE UNDERSTANDING, 1999, 73 (01) : 82 - 98
[18] Synthesis of Compositional Animations from Textual Descriptions
Ghosh, Anindita
Cheema, Noshaba
Oguz, Cennet
Theobalt, Christian
Slusallek, Philipp
[J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 1376 - 1386
[19] Goodfellow IJ, 2014, ADV NEUR IN, V27, P2672
[20] AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions
Gu, Chunhui
Sun, Chen
Ross, David A.
Vondrick, Carl
Pantofaru, Caroline
Li, Yeqing
Vijayanarasimhan, Sudheendra
Toderici, George
Ricco, Susanna
Sukthankar, Rahul
Schmid, Cordelia
Malik, Jitendra
[J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6047 - 6056

← 1 2 3 4 5 6 7 8 9 →