LoCATe-GAT: Modeling Multi-Scale Local Context and Action Relationships for Zero-Shot Action Recognition

被引：0

作者：

Sarma, Sandipan ^{[1
]}

Singal, Divyam ^{[1
]}

Sur, Arijit ^{[1
]}

机构：

[1] Indian Inst Technol Guwahati, Dept Comp Sci & Engn, Gauhati 781039, India

来源：

IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE | 2024年

关键词：

Transformers; Visualization; Semantics; Context modeling; Adaptation models; Encoding; Spatiotemporal phenomena; Computational modeling; Training; Zero shot learning; Zero-shot learning; action recognition; transformer; graph attention network;

D O I：

10.1109/TETCI.2024.3499995

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The increasing number of actions in the real world makes it difficult for traditional deep-learning models to recognize unseen actions. Recently, pretrained contrastive image-based visual-language (I-VL) models have been adapted for efficient "zero-shot" scene understanding. Pairing such models with transformers to implement temporal modeling has been rewarding for zero-shot action recognition (ZSAR). However, the significance of modeling the local spatial context of objects and action environments remains unexplored. In this work, we propose a ZSAR framework called LoCATe-GAT, comprising a novel Local Context-Aggregating Temporal transformer (LoCATe) and a Graph Attention Network (GAT). Specifically, image and text encodings extracted from a pretrained I-VL model are used as inputs for LoCATe-GAT. Motivated by the observation that object-centric and environmental contexts drive both distinguishability and functional similarity between actions, LoCATe captures multi-scale local context using dilated convolutional layers during temporal modeling. Furthermore, the proposed GAT models semantic relationships between classes and achieves a strong synergy with the video embeddings produced by LoCATe. Extensive experiments on four widely-used benchmarks - UCF101, HMDB51, ActivityNet, and Kinetics - show we achieve state-of-the-art results. Specifically, we obtain relative gains of 3.8% and 4.8% on these datasets in conventional and 16.6% on UCF101in generalized ZSAR settings. For large-scale datasets like ActivityNet and Kinetics, our method achieves a relative gain of 31.8% and 27.9%, respectively, over the previous methods. Additionally, we gain 25.3% and 18.4% on UCF101 and HMDB51 as per the recent "TruZe" evaluation protocol.

引用

页数：13

共 76 条

[1] Akata Z, 2015, PROC CVPR IEEE, P2927, DOI 10.1109/CVPR.2015.7298911
[2] Alexiou I, 2016, IEEE IMAGE PROC, P4190, DOI 10.1109/ICIP.2016.7533149
[3] Bertasius G, 2021, PR MACH LEARN RES, V139
[4] Bishay M., 2019, P BRIT MACH VIS C
[5] Rethinking Zero-shot Video Classification: End-to-end Training for Realistic Applications
Brattoli, Biagio
Tighe, Joseph
Zhdanov, Fedor
Perona, Pietro
Chalupka, Krzysztof
[J]. 2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, : 4612 - 4622
[6] Bretti C., 2021, P 32 BRIT MACH VIS C
[7] Heilbron FC, 2015, PROC CVPR IEEE, P961, DOI 10.1109/CVPR.2015.7298698
[8] Carreira J., 2018, arXiv
[9] Synthesized Classifiers for Zero-Shot Learning
Changpinyo, Soravit
Chao, Wei-Lun
Gong, Boqing
Sha, Fei
[J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 5327 - 5336
[10] Elaborative Rehearsal for Zero-shot Action Recognition
Chen, Shizhe
Huang, Dong
[J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 13618 - 13627

← 1 2 3 4 5 6 7 8 →