Saliency-aware Spatio-temporal Modeling for Action Recognition on Unmanned Aerial Vehicles

被引：0

作者：

Sheng, Xiaoxiao ^{[1
]}

Shen, Zhiqiang ^{[1
]}

Xiao, Gang ^{[1
]}

机构：

[1] Shanghai Jiao Tong Univ, Shanghai, Peoples R China

来源：

IEEE LATIN AMERICA TRANSACTIONS | 2024年 / 22卷 / 12期

关键词：

Autonomous aerial vehicles; Videos; Motion segmentation; Feature extraction; Adaptation models; Attention mechanisms; Training; Target recognition; Feeds; Drones; deep learning; action recognition; attention mechanism; unmanned aerial vehicles;

D O I：

10.1109/TLA.2024.10789633

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Action recognition on unmanned aerial vehicles (UAVs) must cope with complex backgrounds and focus on small targets. Existing methods usually use additional detectors to extract objects in each frame, and use the object sequence within boxes as the network input. However, for training, they rely on additional detection annotations, and for inference, the multi-stage paradigm increases the burden of deployment on UAV terminals. Therefore, we propose a saliency-aware spatio-temporal network (SaStNet) for UAV-based action recognition in an end-to-end manner. Specifically, the short-term and long-term motion information are captured progressively. For short-term modeling, a saliency-guided enhancement module is designed to learn attention scores for weighting the original features aggregated within neighboring frames. For long-term modeling, informative regions are first adaptively concentrated using a saliency-guided aggregation module. Then, a spatio-temporal decoupling attention mechanism is designed to focus on spatially salient regions and capture temporal relationships within all frames. Integrating these modules into classical backbones encourages the network to focus on moving targets, reducing interference from background noises. Extensive experiments and ablation studies are conducted on UAV-Human, Drone action, and something-something datasets. Compared to state-of-the-art methods, SaStNet achieves a 5.7% accuracy improvement on the UAV-Human dataset using 8-frame inputs.

引用

页码：1026 / 1033

页数：8

共 36 条

[21] SmallBigNet: Integrating Core and Contextual Views for Video Classification
Li, Xianhang
Wang, Yali
Zhou, Zhipeng
Qiao, Yu
[J]. 2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, : 1089 - 1098
[22] TEA: Temporal Excitation and Aggregation for Action Recognition
Li, Yan
Ji, Bin
Shi, Xintian
Zhang, Jianguo
Kang, Bin
Wang, Limin
[J]. 2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, : 906 - 915
[23] TSM: Temporal Shift Module for Efficient Video Understanding
Lin, Ji
Gan, Chuang
Han, Song
[J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 7082 - 7092
[24] Liu ZY, 2020, AAAI CONF ARTIF INTE, V34, P11669
[25] Human activity recognition from UAV-captured video sequences
Mliki, Hazar
Bouhlel, Fatma
Hammami, Mohamed
[J]. PATTERN RECOGNITION, 2020, 100
[26] A Multiviewpoint Outdoor Dataset for Human Action Recognition
Perera, Asanka G.
Law, Yee Wei
Ogunwa, Titilayo T.
Chahl, Javaan
[J]. IEEE TRANSACTIONS ON HUMAN-MACHINE SYSTEMS, 2020, 50 (05) : 405 - 413
[27] Drone-Action: An Outdoor Recorded Drone Video Dataset for Action Recognition
Perera, Asanka G.
Law, Yee Wei
Chahl, Javaan
[J]. DRONES, 2019, 3 (04) : 1 - 16
[28] Ryoo MSS, 2022, Arxiv, DOI [arXiv:2106.11297, 10.48550/arXiv.2106.11297]
[29] Vaswani A, 2017, ADV NEUR IN, V30
[30] An efficient motion visual learning method for video action recognition
Wang, Bin
Chang, Faliang
Liu, Chunsheng
Wang, Wenqian
Ma, Ruiyi
[J]. EXPERT SYSTEMS WITH APPLICATIONS, 2024, 255

← 1 2 3 4 →