Pose-Appearance Relational Modeling for Video Action Recognition

被引:12
作者
Cui, Mengmeng [1 ]
Wang, Wei [1 ]
Zhang, Kunbo [1 ,2 ]
Sun, Zhenan [1 ,2 ]
Wang, Liang [1 ,2 ]
机构
[1] Chinese Acad Sci, Inst Automat, Ctr Res Intelligent Percept & Comp, Natl Lab Pattern Recognit, Beijing 100190, Peoples R China
[2] Univ Chinese Acad Sci, Beijing 101408, Peoples R China
基金
中国国家自然科学基金;
关键词
Action recognition; 2D pose-appearance; relational modeling; temporal attention LSTM; ATTENTION NETWORK; LSTM;
D O I
10.1109/TIP.2022.3228156
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent studies of video action recognition can be classified into two categories: the appearance-based methods and the pose-based methods. The appearance-based methods generally cannot model temporal dynamics of large motion well by virtue of optical flow estimation, while the pose-based methods ignore the visual context information such as typical scenes and objects, which are also important cues for action understanding. In this paper, we tackle these problems by proposing a Pose-Appearance Relational Network (PARNet), which models the correlation between human pose and image appearance, and combines the benefits of these two modalities to improve the robustness towards unconstrained real-world videos. There are three network streams in our model, namely pose stream, appearance stream and relation stream. For the pose stream, a Temporal Multi-Pose RNN module is constructed to obtain the dynamic representations through temporal modeling of 2D poses. For the appearance stream, a Spatial Appearance CNN module is employed to extract the global appearance representation of the video sequence. For the relation stream, a Pose-Aware RNN module is built to connect pose and appearance streams by modeling action-sensitive visual context information. Through jointly optimizing the three modules, PARNet achieves superior performances compared with the state-of-the-arts on both the pose-complete datasets (KTH, Penn-Action, UCF11) and the challenging pose-incomplete datasets (UCF101, HMDB51, JHMDB), demonstrating its robustness towards complex environments and noisy skeletons. Its effectiveness on NTU-RGBD dataset is also validated even compared with 3D skeleton-based methods. Furthermore, an appearance-enhanced PARNet equipped with a RGB-based I3D stream is proposed, which outperforms the Kinetics pre-trained competitors on UCF101 and HMDB51. The better experimental results verify the potentials of our framework by integrating various modules.
引用
收藏
页码:295 / 308
页数:14
相关论文
共 70 条
  • [1] Aly C. A., 2021, SCI PROGRESS-UK, V104, P1
  • [2] [Anonymous], 2014, GENERATING LONG TERM
  • [3] [Anonymous], 2017, P IEEE C COMP VIS PA
  • [4] Asghari-Esfeden S, 2020, IEEE WINT CONF APPL, P546, DOI 10.1109/WACV45572.2020.9093500
  • [5] 2-D Skeleton-Based Action Recognition via Two-Branch Stacked LSTM-RNNs
    Avola, Danilo
    Cascio, Marco
    Cinque, Luigi
    Foresti, Gian Luca
    Massaroni, Cristiano
    Rodola, Emanuele
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2020, 22 (10) : 2481 - 2496
  • [6] Bahdanau D, 2016, Arxiv, DOI [arXiv:1409.0473, 10.48550/arXiv.1409.0473, DOI 10.48550/ARXIV.1409.0473]
  • [7] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
    Carreira, Joao
    Zisserman, Andrew
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 4724 - 4733
  • [8] Cascaded Pyramid Network for Multi-Person Pose Estimation
    Chen, Yilun
    Wang, Zhicheng
    Peng, Yuxiang
    Zhang, Zhiqiang
    Yu, Gang
    Sun, Jian
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 7103 - 7112
  • [9] Cheng J., 2016, PROC C EMPIRICAL MET, P551
  • [10] Choi Jinwoo, 2019, NEURIPS