A Spatio-temporal Transformer for 3D Human Motion Prediction

被引：150

作者：

Aksan, Emre ^{[1
]}

Kaufmann, Manuel ^{[1
]}

Cao, Peng ^{[2
,3
]}

Hilliges, Otmar ^{[1
]}

机构：

[1] Swiss Fed Inst Technol, Dept Comp Sci, Zurich, Switzerland

[2] MIT, Cambridge, MA 02139 USA

[3] Peking Univ, Beijing, Peoples R China

来源：

2021 INTERNATIONAL CONFERENCE ON 3D VISION (3DV 2021) | 2021年

关键词：

D O I：

10.1109/3DV53792.2021.00066

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We propose a novel Transformer-based architecture for the task of generative modelling of 3D human motion. Previous work commonly relies on RNN-based models considering shorter forecast horizons reaching a stationary and often implausible state quickly. Recent studies show that implicit temporal representations in the frequency domain are also effective in making predictions for a predetermined horizon. Our focus lies on learning spatio-temporal representations autoregressively and hence generation of plausible future developments over both short and long term. The proposed model learns high dimensional embeddings for skeletal joints and how to compose a temporally coherent pose via a decoupled temporal and spatial self-attention mechanism. Our dual attention concept allows the model to access current and past information directly and to capture both the structural and the temporal dependencies explicitly. We show empirically that this effectively learns the underlying motion dynamics and reduces error accumulation over time observed in auto-regressive models. Our model is able to make accurate short-term predictions and generate plausible motion sequences over long horizons. We make our code publicly available at https://github.com/eth-ait/motion-transformer.

引用

页码：565 / 574

页数：10

共 37 条

[1] Structured Prediction Helps 3D Human Motion Modelling [J].

Aksan, Emre ;

Kaufmann, Manuel ;

Hilliges, Otmar .

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :7143-7152

[2]

Al-Rfou R, 2019, AAAI CONF ARTIF INTE, P3159

[3]

Bütepage J, 2018, IEEE INT CONF ROBOT, P4563, DOI 10.1109/ICRA.2018.8460651

[4] Deep representation learning for human motion prediction and classification [J].

Butepage, Judith ;

Black, Michael J. ;

Kragic, Danica ;

Kjellstrom, Hedvig .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :1591-1599

[5] Learning Progressive Joint Propagation for Human Motion Prediction [J].

Cai, Yujun ;

Huang, Lin ;

Wang, Yiwei ;

Cham, Tat-Jen ;

Cai, Jianfei ;

Yuan, Junsong ;

Liu, Jun ;

Yang, Xu ;

Zhu, Yiheng ;

Shen, Xiaohui ;

Liu, Ding ;

Liu, Jing ;

Thalmann, Nadia Magnenat .

COMPUTER VISION - ECCV 2020, PT VII, 2020, 12352 :226-242

[6]

Child R., 2019, GENERATING LONG SEQU

[7] Action-Agnostic Human Pose Forecasting [J].

Chiu, Hsu-kuang ;

Adeli, Ehsan ;

Wang, Borui ;

Huang, De-An ;

Niebles, Juan Carlos .

2019 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2019, :1423-1432

[8]

Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171

[9]

Du Xiaoxiao, 2019, IEEE ROBOTICS AUTOMA

[10] Recurrent Network Models for Human Dynamics [J].

Fragkiadaki, Katerina ;

Levine, Sergey ;

Felsen, Panna ;

Malik, Jitendra .

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :4346-4354

← 1 2 3 4 →