Pose-Appearance Relational Modeling for Video Action Recognition

被引：12

作者：

Cui, Mengmeng ^{[1
]}

Wang, Wei ^{[1
]}

Zhang, Kunbo ^{[1
,2
]}

Sun, Zhenan ^{[1
,2
]}

Wang, Liang ^{[1
,2
]}

机构：

[1] Chinese Acad Sci, Inst Automat, Ctr Res Intelligent Percept & Comp, Natl Lab Pattern Recognit, Beijing 100190, Peoples R China

[2] Univ Chinese Acad Sci, Beijing 101408, Peoples R China

来源：

IEEE TRANSACTIONS ON IMAGE PROCESSING | 2023年 / 32卷

基金：

中国国家自然科学基金;

关键词：

Action recognition; 2D pose-appearance; relational modeling; temporal attention LSTM; ATTENTION NETWORK; LSTM;

D O I：

10.1109/TIP.2022.3228156

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recent studies of video action recognition can be classified into two categories: the appearance-based methods and the pose-based methods. The appearance-based methods generally cannot model temporal dynamics of large motion well by virtue of optical flow estimation, while the pose-based methods ignore the visual context information such as typical scenes and objects, which are also important cues for action understanding. In this paper, we tackle these problems by proposing a Pose-Appearance Relational Network (PARNet), which models the correlation between human pose and image appearance, and combines the benefits of these two modalities to improve the robustness towards unconstrained real-world videos. There are three network streams in our model, namely pose stream, appearance stream and relation stream. For the pose stream, a Temporal Multi-Pose RNN module is constructed to obtain the dynamic representations through temporal modeling of 2D poses. For the appearance stream, a Spatial Appearance CNN module is employed to extract the global appearance representation of the video sequence. For the relation stream, a Pose-Aware RNN module is built to connect pose and appearance streams by modeling action-sensitive visual context information. Through jointly optimizing the three modules, PARNet achieves superior performances compared with the state-of-the-arts on both the pose-complete datasets (KTH, Penn-Action, UCF11) and the challenging pose-incomplete datasets (UCF101, HMDB51, JHMDB), demonstrating its robustness towards complex environments and noisy skeletons. Its effectiveness on NTU-RGBD dataset is also validated even compared with 3D skeleton-based methods. Furthermore, an appearance-enhanced PARNet equipped with a RGB-based I3D stream is proposed, which outperforms the Kinetics pre-trained competitors on UCF101 and HMDB51. The better experimental results verify the potentials of our framework by integrating various modules.

引用

页码：295 / 308

页数：14

共 70 条

[1] Aly C. A., 2021, SCI PROGRESS-UK, V104, P1
[2] [Anonymous], 2014, GENERATING LONG TERM
[3] [Anonymous], 2017, P IEEE C COMP VIS PA
[4] Asghari-Esfeden S, 2020, IEEE WINT CONF APPL, P546, DOI 10.1109/WACV45572.2020.9093500
[5] 2-D Skeleton-Based Action Recognition via Two-Branch Stacked LSTM-RNNs
Avola, Danilo
Cascio, Marco
Cinque, Luigi
Foresti, Gian Luca
Massaroni, Cristiano
Rodola, Emanuele
[J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2020, 22 (10) : 2481 - 2496
[6] Bahdanau D, 2016, Arxiv, DOI [arXiv:1409.0473, 10.48550/arXiv.1409.0473, DOI 10.48550/ARXIV.1409.0473]
[7] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
Carreira, Joao
Zisserman, Andrew
[J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 4724 - 4733
[8] Cascaded Pyramid Network for Multi-Person Pose Estimation
Chen, Yilun
Wang, Zhicheng
Peng, Yuxiang
Zhang, Zhiqiang
Yu, Gang
Sun, Jian
[J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 7103 - 7112
[9] Cheng J., 2016, PROC C EMPIRICAL MET, P551
[10] Choi Jinwoo, 2019, NEURIPS

← 1 2 3 4 5 6 7 →