SlowFastFormer for 3D human pose estimation

被引:5
作者
Zhou, Lu [1 ]
Chen, Yingying [1 ]
Wang, Jinqiao [1 ,2 ,3 ,4 ]
机构
[1] Chinese Acad Sci, Inst Automat, Fdn Model Res Ctr, Beijing 100190, Peoples R China
[2] Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing 100049, Peoples R China
[3] Wuhan AI Res, Wuhan 430073, Peoples R China
[4] Peng Cheng Lab, Shenzhen 518066, Peoples R China
基金
中国国家自然科学基金;
关键词
SlowFastFormer; Transformer; Blending; 3D human pose estimation; Hierarchical supervision;
D O I
10.1016/j.cviu.2024.103992
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
3D human pose estimation in videos aims at locating the human joints in the 3D space given a temporal sequence. Motion information and skeleton context are two significant elements for pose estimation in videos. In this paper, we propose a SlowFastFormer (slow -fast transformer) network where two branches with different input rates are composed to encode these two different kinds of context. For the slow branch, skeleton context is well learned at a higher frame rate. For the fast branch, motion information is captured at a lower frame rate. Through these two branches, different kinds of context are encoded separately. We fuse these two branches at a later stage to fully utilize the skeleton context and motion information. Afterwards, a blending module is developed to promote the message exchange among multiple branches. In the blending stage, different kinds of context information are exchanged and feature representation is enhanced consequently. Lastly, a hierarchical supervision scheme is tailored where predictions of different levels are inferred in a progressive manner. Our approach achieves competitive performance with lower computation complexity on several benchmarks, i.e., Human3.6M, MPI-INF-3DHP and HumanEva-I.
引用
收藏
页数:9
相关论文
共 70 条
[51]   An adversarial human pose estimation network injected with graph structure [J].
Tian, Lei ;
Wang, Peng ;
Liang, Guoqiang ;
Shen, Chunhua .
PATTERN RECOGNITION, 2021, 115
[52]  
Vaswani A, 2017, ADV NEUR IN, V30
[53]   Convolutional Embedding Makes Hierarchical Vision Transformer Stronger [J].
Wang, Cong ;
Xu, Hongmin ;
Zhang, Xiong ;
Wang, Li ;
Zheng, Zhitong ;
Liu, Haifeng .
COMPUTER VISION, ECCV 2022, PT XX, 2022, 13680 :739-756
[54]   Collaborative three-stream transformers for video captioning [J].
Wang, Hao ;
Zhang, Libo ;
Fan, Heng ;
Luo, Tiejian .
COMPUTER VISION AND IMAGE UNDERSTANDING, 2023, 235
[55]   Motion Guided 3D Pose Estimation from Videos [J].
Wang, Jingbo ;
Yan, Sijie ;
Xiong, Yuanjun ;
Lin, Dahua .
COMPUTER VISION - ECCV 2020, PT XIII, 2020, 12358 :764-780
[56]   Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions [J].
Wang, Wenhai ;
Xie, Enze ;
Li, Xiang ;
Fan, Deng-Ping ;
Song, Kaitao ;
Liang, Ding ;
Lu, Tong ;
Luo, Ping ;
Shao, Ling .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :548-558
[57]   Efficient dual attention SlowFast networks for video action recognition [J].
Wei, Dafeng ;
Tian, Ye ;
Wei, Liqing ;
Zhong, Hong ;
Chen, Siqian ;
Pu, Shiliang ;
Lu, Hongtao .
COMPUTER VISION AND IMAGE UNDERSTANDING, 2022, 222
[58]   View Invariant 3D Human Pose Estimation [J].
Wei, Guoqiang ;
Lan, Cuiling ;
Zeng, Wenjun ;
Chen, Zhibo .
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2020, 30 (12) :4601-4610
[59]   SpatioTemporal focus for skeleton-based action recognition [J].
Wu, Liyu ;
Zhang, Can ;
Zou, Yuexian .
PATTERN RECOGNITION, 2023, 136
[60]   Graph Stacked Hourglass Networks for 3D Human Pose Estimation [J].
Xu, Tianhan ;
Takano, Wataru .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :16100-16109