Still image action recognition based on interactions between joints and objects

被引：4

作者：

Ashrafi, Seyed Sajad ^{[1
]}

Shokouhi, Shahriar B. ^{[1
]}

Ayatollahi, Ahmad ^{[1
]}

机构：

[1] Iran Univ Sci & Technol IUST, Elect Engn Dept, Tehran, Iran

来源：

MULTIMEDIA TOOLS AND APPLICATIONS | 2023年 / 82卷 / 17期

关键词：

Still image-based action recognition; Self-attention; Cross-attention; Convolutional neural networks (CNN); Atrous spatial pyramid pooling (ASPP);

D O I：

10.1007/s11042-023-14350-z

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Still image-based action recognition is a challenging area in which recognition is performed based on only a single input image. Utilizing auxiliary information such as pose, object, or background is one of the common techniques in this field. However, the simultaneous use of several auxiliary components and their optimal combinations is less studied. In this study, two cues of body joints and objects have been employed simultaneously, and an attention module is proposed to combine the features of these two components. The attention module consists of two self-attentions and a cross-attention, which are designed to account for the interaction between the objects, between the joints, and between the joints and objects, respectively. In addition, the Multi-scale Atrous Spatial Pyramid Pooling (MASPP) module is proposed to reduce the number of parameters of the proposed method and at the same time, combine the features obtained from different levels of the backbone. The Joint Object Pooling (JOPool) module is proposed to extract local features from joints and objects regions. ResNets are used as the backbone, and the stride of the last two layers is changed. Experimental results on different datasets show that the combination of several auxiliary components can be effective in increasing the mean Average Precision (mAP) of recognition. The proposed method is evaluated on three important datasets: Stanford-40, PASCAL VOC 2012, and BU101PLUS resulting in 94.84%, 93.20%, and 91.25% mAPs, respectively. The obtained mAPs are higher than the best preceding proposed methods.

引用

页码：25945 / 25971

页数：27

共 54 条

[11] Multi-expert human action recognition with hierarchical super-class learning
Dehkordi, Hojat Asgarian
Nezhad, Ali Soltani
Kashiani, Hossein
Shokouhi, Shahriar Baradaran
Ayatollahi, Ahmad
[J]. KNOWLEDGE-BASED SYSTEMS, 2022, 250
[12] Dosovitskiy A., 2020, INT C LEARN REPR, DOI DOI 10.48550/ARXIV.2010.11929
[13] The Pascal Visual Object Classes (VOC) Challenge
Everingham, Mark
Van Gool, Luc
Williams, Christopher K. I.
Winn, John
Zisserman, Andrew
[J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2010, 88 (02) : 303 - 338
[14] Contextual Action Recognition with R*CNN
Gkioxari, Georgia
Girshick, Ross
Malik, Jitendra
[J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 1080 - 1088
[15] A survey on still image based human action recognition
Guo, Guodong
Lai, Alice
[J]. PATTERN RECOGNITION, 2014, 47 (10) : 3343 - 3361
[16] He KM, 2020, IEEE T PATTERN ANAL, V42, P386, DOI [10.1109/TPAMI.2018.2844175, 10.1109/ICCV.2017.322]
[17] Deep Residual Learning for Image Recognition
He, Kaiming
Zhang, Xiangyu
Ren, Shaoqing
Sun, Jian
[J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 770 - 778
[18] Going deeper into action recognition: A survey
Herath, Samitha
Harandi, Mehrtash
Porikli, Fatih
[J]. IMAGE AND VISION COMPUTING, 2017, 60 : 4 - 21
[19] Hinton G., 2015, ARXIV
[20] Human action recognition based on scene semantics
Hu, Tao
Zhu, Xinyan
Guo, Wei
Wang, Shaohua
Zhu, Jianfeng
[J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2019, 78 (20) : 28515 - 28536

← 1 2 3 4 5 6 →