SlowFastFormer for 3D human pose estimation

被引：5

作者：

Zhou, Lu ^{[1
]}

Chen, Yingying ^{[1
]}

Wang, Jinqiao ^{[1
,2
,3
,4
]}

机构：

[1] Chinese Acad Sci, Inst Automat, Fdn Model Res Ctr, Beijing 100190, Peoples R China

[2] Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing 100049, Peoples R China

[3] Wuhan AI Res, Wuhan 430073, Peoples R China

[4] Peng Cheng Lab, Shenzhen 518066, Peoples R China

来源：

COMPUTER VISION AND IMAGE UNDERSTANDING | 2024年 / 243卷

基金：

中国国家自然科学基金;

关键词：

SlowFastFormer; Transformer; Blending; 3D human pose estimation; Hierarchical supervision;

D O I：

10.1016/j.cviu.2024.103992

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

3D human pose estimation in videos aims at locating the human joints in the 3D space given a temporal sequence. Motion information and skeleton context are two significant elements for pose estimation in videos. In this paper, we propose a SlowFastFormer (slow -fast transformer) network where two branches with different input rates are composed to encode these two different kinds of context. For the slow branch, skeleton context is well learned at a higher frame rate. For the fast branch, motion information is captured at a lower frame rate. Through these two branches, different kinds of context are encoded separately. We fuse these two branches at a later stage to fully utilize the skeleton context and motion information. Afterwards, a blending module is developed to promote the message exchange among multiple branches. In the blending stage, different kinds of context information are exchanged and feature representation is enhanced consequently. Lastly, a hierarchical supervision scheme is tailored where predictions of different levels are inferred in a progressive manner. Our approach achieves competitive performance with lower computation complexity on several benchmarks, i.e., Human3.6M, MPI-INF-3DHP and HumanEva-I.

引用

页数：9

共 70 条

[51] An adversarial human pose estimation network injected with graph structure [J].

Tian, Lei ;

Wang, Peng ;

Liang, Guoqiang ;

Shen, Chunhua .

PATTERN RECOGNITION, 2021, 115

[52]

Vaswani A, 2017, ADV NEUR IN, V30

[53] Convolutional Embedding Makes Hierarchical Vision Transformer Stronger [J].

Wang, Cong ;

Xu, Hongmin ;

Zhang, Xiong ;

Wang, Li ;

Zheng, Zhitong ;

Liu, Haifeng .

COMPUTER VISION, ECCV 2022, PT XX, 2022, 13680 :739-756

[54] Collaborative three-stream transformers for video captioning [J].

Wang, Hao ;

Zhang, Libo ;

Fan, Heng ;

Luo, Tiejian .

COMPUTER VISION AND IMAGE UNDERSTANDING, 2023, 235

[55] Motion Guided 3D Pose Estimation from Videos [J].

Wang, Jingbo ;

Yan, Sijie ;

Xiong, Yuanjun ;

Lin, Dahua .

COMPUTER VISION - ECCV 2020, PT XIII, 2020, 12358 :764-780

[56] Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions [J].

Wang, Wenhai ;

Xie, Enze ;

Li, Xiang ;

Fan, Deng-Ping ;

Song, Kaitao ;

Liang, Ding ;

Lu, Tong ;

Luo, Ping ;

Shao, Ling .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :548-558

[57] Efficient dual attention SlowFast networks for video action recognition [J].

Wei, Dafeng ;

Tian, Ye ;

Wei, Liqing ;

Zhong, Hong ;

Chen, Siqian ;

Pu, Shiliang ;

Lu, Hongtao .

COMPUTER VISION AND IMAGE UNDERSTANDING, 2022, 222

[58] View Invariant 3D Human Pose Estimation [J].

Wei, Guoqiang ;

Lan, Cuiling ;

Zeng, Wenjun ;

Chen, Zhibo .

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2020, 30 (12) :4601-4610

[59] SpatioTemporal focus for skeleton-based action recognition [J].

Wu, Liyu ;

Zhang, Can ;

Zou, Yuexian .

PATTERN RECOGNITION, 2023, 136

[60] Graph Stacked Hourglass Networks for 3D Human Pose Estimation [J].

Xu, Tianhan ;

Takano, Wataru .

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :16100-16109

← 1 2 3 4 5 6 7 →