Playing for 3D Human Recovery

被引:1
作者
Cai, Zhongang [1 ,2 ]
Zhang, Mingyuan [1 ]
Ren, Jiawei [1 ]
Wei, Chen [2 ]
Ren, Daxuan [1 ]
Lin, Zhengyu [2 ]
Zhao, Haiyu [2 ]
Yang, Lei [2 ]
Loy, Chen Change [1 ]
Liu, Ziwei [1 ]
机构
[1] Nanyang Technol Univ, S Lab, Singapore 639798, Singapore
[2] Shanghai AI Lab, Shanghai 200240, Peoples R China
关键词
Three-dimensional displays; Annotations; Synthetic data; Shape; Training; Parametric statistics; Solid modeling; Human pose and shape estimation; 3D human recovery; parametric humans; synthetic data; dataset;
D O I
10.1109/TPAMI.2024.3450537
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Image- and video-based 3D human recovery (i.e., pose and shape estimation) have achieved substantial progress. However, due to the prohibitive cost of motion capture, existing datasets are often limited in scale and diversity. In this work, we obtain massive human sequences by playing the video game with automatically annotated 3D ground truths. Specifically, we contribute GTA-Human, a large-scale 3D human dataset generated with the GTA-V game engine, featuring a highly diverse set of subjects, actions, and scenarios. More importantly, we study the use of game-playing data and obtain five major insights. First, game-playing data is surprisingly effective. A simple frame-based baseline trained on GTA-Human outperforms more sophisticated methods by a large margin. For video-based methods, GTA-Human is even on par with the in-domain training set. Second, we discover that synthetic data provides critical complements to the real data that is typically collected indoor. We highlight that our investigation into domain gap provides explanations for our data mixture strategies that are simple yet useful, which offers new insights to the research community. Third, the scale of the dataset matters. The performance boost is closely related to the additional data available. A systematic study on multiple key factors (such as camera angle and body pose) reveals that the model performance is sensitive to data density. Fourth, the effectiveness of GTA-Human is also attributed to the rich collection of strong supervision labels (SMPL parameters), which are otherwise expensive to acquire in real datasets. Fifth, the benefits of synthetic data extend to larger models such as deeper convolutional neural networks (CNNs) and Transformers, for which a significant impact is also observed. We hope our work could pave the way for scaling up 3D human recovery to the real world.
引用
收藏
页码:10533 / 10545
页数:13
相关论文
共 77 条
  • [1] 2D Human Pose Estimation: New Benchmark and State of the Art Analysis
    Andriluka, Mykhaylo
    Pishchulin, Leonid
    Gehler, Peter
    Schiele, Bernt
    [J]. 2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2014, : 3686 - 3693
  • [2] Exploiting temporal context for 3D human pose estimation in the wild
    Arnab, Anurag
    Doersch, Carl
    Zisserman, Andrew
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 3390 - 3399
  • [3] Keep It SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image
    Bogo, Federica
    Kanazawa, Angjoo
    Lassner, Christoph
    Gehler, Peter
    Romero, Javier
    Black, Michael J.
    [J]. COMPUTER VISION - ECCV 2016, PT V, 2016, 9909 : 561 - 578
  • [4] Cai Z., 2020, P EUR C COMP VIS, P1
  • [5] HuMMan: Multi-modal 4D Human Dataset for Versatile Sensing and Modeling
    Cai, Zhongang
    Ren, Daxuan
    Zeng, Ailing
    Lin, Zhengyu
    Yu, Tao
    Wang, Wenjia
    Fan, Xiangyu
    Gao, Yang
    Yu, Yifan
    Pan, Liang
    Hong, Fangzhou
    Zhang, Mingyuan
    Loy, Chen Change
    Yang, Lei
    Liu, Ziwei
    [J]. COMPUTER VISION, ECCV 2022, PT VII, 2022, 13667 : 557 - 577
  • [6] Cao Z., 2020, P EUR C COMP VIS, P387, DOI [10.1007/978-3-030-58452-8_23, DOI 10.1007/978-3-030-58452-8_23]
  • [7] Chen X., 2021, Proceedings of the 38th International Conference on Machine Learning, P1749
  • [8] Beyond Static Features for Temporally Consistent 3D Human Pose and Shape from a Video
    Choi, Hongsuk
    Moon, Gyeongsik
    Chang, Ju Yong
    Lee, Kyoung Mu
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 1964 - 1973
  • [9] Dosovitskiy A., 2020, INT C LEARN REPR
  • [10] MOTSynth: How Can Synthetic Data Help Pedestrian Detection and Tracking?
    Fabbri, Matteo
    Braso, Guillem
    Maugeri, Gianluca
    Cetintas, Orcun
    Gasparini, Riccardo
    Osep, Aljosa
    Calderara, Simone
    Leal-Taixe, Laura
    Cucchiara, Rita
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 10829 - 10839