Still image action recognition based on interactions between joints and objects

被引:4
作者
Ashrafi, Seyed Sajad [1 ]
Shokouhi, Shahriar B. [1 ]
Ayatollahi, Ahmad [1 ]
机构
[1] Iran Univ Sci & Technol IUST, Elect Engn Dept, Tehran, Iran
关键词
Still image-based action recognition; Self-attention; Cross-attention; Convolutional neural networks (CNN); Atrous spatial pyramid pooling (ASPP);
D O I
10.1007/s11042-023-14350-z
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Still image-based action recognition is a challenging area in which recognition is performed based on only a single input image. Utilizing auxiliary information such as pose, object, or background is one of the common techniques in this field. However, the simultaneous use of several auxiliary components and their optimal combinations is less studied. In this study, two cues of body joints and objects have been employed simultaneously, and an attention module is proposed to combine the features of these two components. The attention module consists of two self-attentions and a cross-attention, which are designed to account for the interaction between the objects, between the joints, and between the joints and objects, respectively. In addition, the Multi-scale Atrous Spatial Pyramid Pooling (MASPP) module is proposed to reduce the number of parameters of the proposed method and at the same time, combine the features obtained from different levels of the backbone. The Joint Object Pooling (JOPool) module is proposed to extract local features from joints and objects regions. ResNets are used as the backbone, and the stride of the last two layers is changed. Experimental results on different datasets show that the combination of several auxiliary components can be effective in increasing the mean Average Precision (mAP) of recognition. The proposed method is evaluated on three important datasets: Stanford-40, PASCAL VOC 2012, and BU101PLUS resulting in 94.84%, 93.20%, and 91.25% mAPs, respectively. The obtained mAPs are higher than the best preceding proposed methods.
引用
收藏
页码:25945 / 25971
页数:27
相关论文
共 54 条
  • [11] Multi-expert human action recognition with hierarchical super-class learning
    Dehkordi, Hojat Asgarian
    Nezhad, Ali Soltani
    Kashiani, Hossein
    Shokouhi, Shahriar Baradaran
    Ayatollahi, Ahmad
    [J]. KNOWLEDGE-BASED SYSTEMS, 2022, 250
  • [12] Dosovitskiy A., 2020, INT C LEARN REPR, DOI DOI 10.48550/ARXIV.2010.11929
  • [13] The Pascal Visual Object Classes (VOC) Challenge
    Everingham, Mark
    Van Gool, Luc
    Williams, Christopher K. I.
    Winn, John
    Zisserman, Andrew
    [J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2010, 88 (02) : 303 - 338
  • [14] Contextual Action Recognition with R*CNN
    Gkioxari, Georgia
    Girshick, Ross
    Malik, Jitendra
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 1080 - 1088
  • [15] A survey on still image based human action recognition
    Guo, Guodong
    Lai, Alice
    [J]. PATTERN RECOGNITION, 2014, 47 (10) : 3343 - 3361
  • [16] He KM, 2020, IEEE T PATTERN ANAL, V42, P386, DOI [10.1109/TPAMI.2018.2844175, 10.1109/ICCV.2017.322]
  • [17] Deep Residual Learning for Image Recognition
    He, Kaiming
    Zhang, Xiangyu
    Ren, Shaoqing
    Sun, Jian
    [J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 770 - 778
  • [18] Going deeper into action recognition: A survey
    Herath, Samitha
    Harandi, Mehrtash
    Porikli, Fatih
    [J]. IMAGE AND VISION COMPUTING, 2017, 60 : 4 - 21
  • [19] Hinton G., 2015, ARXIV
  • [20] Human action recognition based on scene semantics
    Hu, Tao
    Zhu, Xinyan
    Guo, Wei
    Wang, Shaohua
    Zhu, Jianfeng
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2019, 78 (20) : 28515 - 28536