MLUG: Bootstrapping Language-Motion Pre-Training for Unified Motion-Language Understanding and Generation

被引:0
作者
Luo, Hongliang [1 ]
Xi, Wei [1 ]
Tang, Daniel [2 ]
机构
[1] Xi An Jiao Tong Univ, Sch Comp Sci & Technol, Xian 710049, Peoples R China
[2] Mind Bridge AI Ltd, Ottawa, ON K1S 5R5, Canada
关键词
motion generation; language motion; unified models;
D O I
10.3390/s24227354
中图分类号
O65 [分析化学];
学科分类号
070302 ; 081704 ;
摘要
In the realm of computer vision and animation, the generation of human motion from textual descriptions represents a frontier of significant challenge and potential. This paper introduces MLUG, a groundbreaking framework poised to transform motion synthesis by harnessing the power of vision-language pre-training techniques. MLUG addresses the nuanced challenge of creating semantically rich, physically plausible, and emotionally expressive human motions through a novel integration of a unimodal encoder with motion-text contrastive loss, a motion-grounded text encoder, a motion-grounded motion decoder, and a motion length predictor. These components work in concert to align textual descriptions with dynamic motion sequences, offering an innovative solution to the limitations of existing models in open-vocabulary motion generation and emotional expressiveness. Through extensive evaluations, MLUG demonstrates unparalleled effectiveness in generating realistic and diverse motions from a broad spectrum of textual inputs, setting a new benchmark in the field.
引用
收藏
页数:13
相关论文
共 53 条
  • [1] Aggarwal G., 2021, arXiv
  • [2] Ahn H, 2018, IEEE INT CONF ROBOT, P5915
  • [3] Language2Pose: Natural Language Grounded Pose Forecasting
    Ahuja, Chaitanya
    Morency, Louis-Philippe
    [J]. 2019 INTERNATIONAL CONFERENCE ON 3D VISION (3DV 2019), 2019, : 719 - 728
  • [4] Noise Reduction in Human Motion-Captured Signals for Computer Animation based on B-Spline Filtering
    Ardestani, Mehdi Memar
    Yan, Hong
    [J]. SENSORS, 2022, 22 (12)
  • [5] Athanasiou N., 2022, P INT C 3D VIS 3DV P
  • [6] Brown TB, 2020, ADV NEUR IN, V33
  • [7] Cai ZA, 2023, Arxiv, DOI arXiv:2204.13686
  • [8] Cai ZA, 2024, Arxiv, DOI arXiv:2110.07588
  • [9] Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts
    Changpinyo, Soravit
    Sharma, Piyush
    Ding, Nan
    Soricut, Radu
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 3557 - 3567
  • [10] Executing your Commands via Motion Diffusion in Latent Space
    Chen, Xin
    Jiang, Biao
    Liu, Wen
    Huang, Zilong
    Fu, Bin
    Chen, Tao
    Yu, Gang
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 18000 - 18010