Saliency-aware Spatio-temporal Modeling for Action Recognition on Unmanned Aerial Vehicles

被引:0
作者
Sheng, Xiaoxiao [1 ]
Shen, Zhiqiang [1 ]
Xiao, Gang [1 ]
机构
[1] Shanghai Jiao Tong Univ, Shanghai, Peoples R China
关键词
Autonomous aerial vehicles; Videos; Motion segmentation; Feature extraction; Adaptation models; Attention mechanisms; Training; Target recognition; Feeds; Drones; deep learning; action recognition; attention mechanism; unmanned aerial vehicles;
D O I
10.1109/TLA.2024.10789633
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Action recognition on unmanned aerial vehicles (UAVs) must cope with complex backgrounds and focus on small targets. Existing methods usually use additional detectors to extract objects in each frame, and use the object sequence within boxes as the network input. However, for training, they rely on additional detection annotations, and for inference, the multi-stage paradigm increases the burden of deployment on UAV terminals. Therefore, we propose a saliency-aware spatio-temporal network (SaStNet) for UAV-based action recognition in an end-to-end manner. Specifically, the short-term and long-term motion information are captured progressively. For short-term modeling, a saliency-guided enhancement module is designed to learn attention scores for weighting the original features aggregated within neighboring frames. For long-term modeling, informative regions are first adaptively concentrated using a saliency-guided aggregation module. Then, a spatio-temporal decoupling attention mechanism is designed to focus on spatially salient regions and capture temporal relationships within all frames. Integrating these modules into classical backbones encourages the network to focus on moving targets, reducing interference from background noises. Extensive experiments and ablation studies are conducted on UAV-Human, Drone action, and something-something datasets. Compared to state-of-the-art methods, SaStNet achieves a 5.7% accuracy improvement on the UAV-Human dataset using 8-frame inputs.
引用
收藏
页码:1026 / 1033
页数:8
相关论文
共 36 条
  • [1] Direct measurements of the branching fractions for D0→K-e+ve and D0→π-e+ve and determinations of the form factors fK+(0) and fπ+(0)
    Ablikim, M
    Bai, JZ
    Ban, Y
    Bian, JG
    Cai, X
    Chang, JF
    Chen, HF
    Chen, HS
    Chen, HX
    Chen, JC
    Chen, J
    Chen, J
    Chen, ML
    Chen, YB
    Chi, SP
    Chu, YP
    Cui, XZ
    Dai, HL
    Dai, YS
    Deng, ZY
    Dong, LY
    Du, SX
    Du, ZZ
    Fang, J
    Fang, SS
    Fu, CD
    Fu, HY
    Gao, CS
    Gao, YN
    Gong, MY
    Gong, WX
    Gu, SD
    Guo, YN
    Guo, YQ
    He, KL
    He, M
    He, X
    Heng, YK
    Hu, HM
    Hu, T
    Huang, L
    Huang, XP
    Ji, XB
    Jia, QY
    Jiang, CH
    Jiang, XS
    Jin, DP
    Jin, S
    Jin, Y
    Lai, YF
    [J]. PHYSICS LETTERS B, 2004, 597 (01) : 39 - 46
  • [2] Self-Supervised Scene-Debiasing for Video Representation Learning via Background Patching
    Assefa, Maregu
    Jiang, Wei
    Gedamu, Kumie
    Yilma, Getinet
    Kumeda, Bulbula
    Ayalew, Melese
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 5500 - 5515
  • [3] Fine grained pointing recognition for natural drone guidance
    Barbed, O. L.
    Azagra, P.
    Teixeira, L.
    Chli, M.
    Civera, J.
    Murillo, A. C.
    [J]. 2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW 2020), 2020, : 4480 - 4488
  • [4] Okutama-Action: An Aerial View Video Dataset for Concurrent Human Action Detection
    Barekatain, Mohammadamin
    Marti, Miquel
    Shih, Hsueh-Fu
    Murray, Samuel
    Nakayama, Kotaro
    Matsuo, Yutaka
    Prendinger, Helmut
    [J]. 2017 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW), 2017, : 2153 - 2160
  • [5] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
    Carreira, Joao
    Zisserman, Andrew
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 4724 - 4733
  • [6] Choi J, 2020, IEEE WINT CONF APPL, P1706, DOI 10.1109/WACV45572.2020.9093511
  • [7] TinyVIRAT: Low-resolution Video Action Recognition
    Demir, Ugur
    Rawat, Yogesh S.
    Shah, Mubarak
    [J]. 2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 7387 - 7394
  • [8] Adversarial robustness improvement for deep neural networks
    Eleftheriadis, Charis
    Symeonidis, Andreas
    Katsaros, Panagiotis
    [J]. MACHINE VISION AND APPLICATIONS, 2024, 35 (03)
  • [9] Multiscale Vision Transformers
    Fan, Haoqi
    Xiong, Bo
    Mangalam, Karttikeya
    Li, Yanghao
    Yan, Zhicheng
    Malik, Jitendra
    Feichtenhofer, Christoph
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 6804 - 6815
  • [10] X3D: Expanding Architectures for Efficient Video Recognition
    Feichtenhofer, Christoph
    [J]. 2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, : 200 - 210