LoCATe-GAT: Modeling Multi-Scale Local Context and Action Relationships for Zero-Shot Action Recognition

被引:0
作者
Sarma, Sandipan [1 ]
Singal, Divyam [1 ]
Sur, Arijit [1 ]
机构
[1] Indian Inst Technol Guwahati, Dept Comp Sci & Engn, Gauhati 781039, India
来源
IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE | 2024年
关键词
Transformers; Visualization; Semantics; Context modeling; Adaptation models; Encoding; Spatiotemporal phenomena; Computational modeling; Training; Zero shot learning; Zero-shot learning; action recognition; transformer; graph attention network;
D O I
10.1109/TETCI.2024.3499995
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The increasing number of actions in the real world makes it difficult for traditional deep-learning models to recognize unseen actions. Recently, pretrained contrastive image-based visual-language (I-VL) models have been adapted for efficient "zero-shot" scene understanding. Pairing such models with transformers to implement temporal modeling has been rewarding for zero-shot action recognition (ZSAR). However, the significance of modeling the local spatial context of objects and action environments remains unexplored. In this work, we propose a ZSAR framework called LoCATe-GAT, comprising a novel Local Context-Aggregating Temporal transformer (LoCATe) and a Graph Attention Network (GAT). Specifically, image and text encodings extracted from a pretrained I-VL model are used as inputs for LoCATe-GAT. Motivated by the observation that object-centric and environmental contexts drive both distinguishability and functional similarity between actions, LoCATe captures multi-scale local context using dilated convolutional layers during temporal modeling. Furthermore, the proposed GAT models semantic relationships between classes and achieves a strong synergy with the video embeddings produced by LoCATe. Extensive experiments on four widely-used benchmarks - UCF101, HMDB51, ActivityNet, and Kinetics - show we achieve state-of-the-art results. Specifically, we obtain relative gains of 3.8% and 4.8% on these datasets in conventional and 16.6% on UCF101in generalized ZSAR settings. For large-scale datasets like ActivityNet and Kinetics, our method achieves a relative gain of 31.8% and 27.9%, respectively, over the previous methods. Additionally, we gain 25.3% and 18.4% on UCF101 and HMDB51 as per the recent "TruZe" evaluation protocol.
引用
收藏
页数:13
相关论文
共 76 条
  • [1] Akata Z, 2015, PROC CVPR IEEE, P2927, DOI 10.1109/CVPR.2015.7298911
  • [2] Alexiou I, 2016, IEEE IMAGE PROC, P4190, DOI 10.1109/ICIP.2016.7533149
  • [3] Bertasius G, 2021, PR MACH LEARN RES, V139
  • [4] Bishay M., 2019, P BRIT MACH VIS C
  • [5] Rethinking Zero-shot Video Classification: End-to-end Training for Realistic Applications
    Brattoli, Biagio
    Tighe, Joseph
    Zhdanov, Fedor
    Perona, Pietro
    Chalupka, Krzysztof
    [J]. 2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, : 4612 - 4622
  • [6] Bretti C., 2021, P 32 BRIT MACH VIS C
  • [7] Heilbron FC, 2015, PROC CVPR IEEE, P961, DOI 10.1109/CVPR.2015.7298698
  • [8] Carreira J., 2018, arXiv
  • [9] Synthesized Classifiers for Zero-Shot Learning
    Changpinyo, Soravit
    Chao, Wei-Lun
    Gong, Boqing
    Sha, Fei
    [J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 5327 - 5336
  • [10] Elaborative Rehearsal for Zero-shot Action Recognition
    Chen, Shizhe
    Huang, Dong
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 13618 - 13627