A Tracking-Based Two-Stage Framework for Spatio-Temporal Action Detection

被引：1

作者：

Luo, Jing ^{[1
]}

Yang, Yulin ^{[1
,2
]}

Liu, Rongkai ^{[1
]}

Chen, Li ^{[1
]}

Fei, Hongxiao ^{[1
]}

Hu, Chao ^{[3
,4
]}

Shi, Ronghua ^{[3
]}

Zou, You ^{[5
]}

机构：

[1] Cent South Univ, Sch Comp, Changsha 410000, Peoples R China

[2] Hunan Hanma Technol Co Ltd, Changsha 410083, Peoples R China

[3] Cent South Univ, Sch Elect Informat, Changsha 410000, Peoples R China

[4] Cent South Univ, Hunan Res Base Educ Sci Res Educ Informatizat 14 5, Changsha 410083, Peoples R China

[5] Cent South Univ, Informat & Networking Ctr, Changsha 410083, Peoples R China

来源：

ELECTRONICS | 2024年 / 13卷 / 03期

关键词：

artificial intelligence; computer vision; action detection; object tracking;

D O I：

10.3390/electronics13030479

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Spatio-temporal action detection (STAD) is a task receiving widespread attention and has numerous application scenarios, such as video surveillance and smart education. Current studies follow a localization-based two-stage detection paradigm, which exploits a person detector for action localization and a feature processing model with a classifier for action classification. However, many issues occur due to the imbalance between task settings and model complexity in STAD. Firstly, the model complexity of heavy offline person detectors adds to the inference overhead. Secondly, the frame-level actor proposals are incompatible with the video-level feature aggregation and Region-of-Interest feature pooling in action classification, which limits the detection performance under diverse action motions and results in low detection accuracy. In this paper, we propose a tracking-based two-stage spatio-temporal action detection framework called TrAD. The key idea of TrAD is to build video-level consistency and reduce model complexity in our STAD framework by generating action track proposals among multiple video frames instead of actor proposals in a single frame. In particular, we utilize tailored tracking to simulate the behavior of human cognitive actions and used the captured motion trajectories as video-level proposals. We then integrate a proposal scaling method and a feature aggregation module into action classification to enhance feature pooling for detected tracks. Evaluations in the AVA dataset demonstrate that TrAD achieves SOTA performance with 29.7 mAP, while also facilitating a 58% reduction in overall computation compared to SlowFast.

引用

页数：14

共 42 条

[1]

Bertasius G, 2021, PR MACH LEARN RES, V139

[2] End-to-End Object Detection with Transformers [J].

Carion, Nicolas ;

Massa, Francisco ;

Synnaeve, Gabriel ;

Usunier, Nicolas ;

Kirillov, Alexander ;

Zagoruyko, Sergey .

COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :213-229

[3] MPCCT: Multimodal vision-language learning paradigm with context-based compact Transformer [J].

Chen, Chongqing ;

Han, Dezhi ;

Chang, Chin-Chen .

PATTERN RECOGNITION, 2024, 147

[4] CLVIN: Complete language-vision interaction network for visual question answering [J].

Chen, Chongqing ;

Han, Dezhi ;

Shen, Xiang .

KNOWLEDGE-BASED SYSTEMS, 2023, 275

[5] Watch Only Once: An End-to-End Video Action Detection Framework [J].

Chen, Shoufa ;

Sun, Peize ;

Xie, Enze ;

Ge, Chongjian ;

Wu, Jiannan ;

Ma, Lan ;

Shen, Jiajun ;

Luo, Ping .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :8158-8167

[6] GabriellaV2: Towards better generalization in surveillance videos for Action Detection [J].

Dave, Ishan ;

Scheffer, Zacchaeus ;

Kumar, Akash ;

Shiraz, Sarah ;

Rawat, Yogesh Singh ;

Shah, Mubarak .

2022 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION WORKSHOPS (WACVW 2022), 2022, :122-132

[7] SlowFast Networks for Video Recognition [J].

Feichtenhofer, Christoph ;

Fan, Haoqi ;

Malik, Jitendra ;

He, Kaiming .

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :6201-6210

[8]

Girdhar Rohit, 2018, CoRR

[9] Fast R-CNN [J].

Girshick, Ross .

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :1440-1448

[10]

Gkioxari G, 2015, PROC CVPR IEEE, P759, DOI 10.1109/CVPR.2015.7298676

← 1 2 3 4 5 →