Action detection with two-stream enhanced detector

被引：3

作者：

Zhang, Min ^{[1
]}

Hu, Haiyang ^{[1
]}

Li, Zhongjin ^{[1
]}

Chen, Jie ^{[1
]}

机构：

[1] Hangzhou Dianzi Univ, Sch Comp Sci & Technol, Hangzhou, Peoples R China

来源：

VISUAL COMPUTER | 2023年 / 39卷 / 03期

基金：

中国国家自然科学基金;

关键词：

Action detection; Spatiotemporal localization; Object detection; Anchor cuboid; ATTENTION;

D O I：

10.1007/s00371-021-02397-8

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

Action understanding in videos is a challenging task that has attracted widespread attention in recent years. Most current methods localize bounding box of actors at frame level, and then track or link these detections to form action tubes across frames. These methods often focus on utilizing temporal context in videos while neglecting the importance of the detector itself. In this paper, we present a two-stream enhanced framework to deal with the problem of action detection. Specifically, we devise an appearance and motion detectors in two-stream manner to detect actions, which take k consecutive RGB frames and optical flow images as input respectively. To improve the feature presentation capabilities, anchor refinement sub-module with feature alignment is introduced into the two-stream architecture to generate flexible anchor cuboids. Meanwhile, hierarchical fusion strategy is utilized to concatenate intermediate feature maps for capturing fast moving subjects. Moreover, layer normalization with skip connection is adopted to reduce the internal co-variate shift between network layers, which makes the training process simple and effective. Compared to state-of-the-art methods, the proposed approach yields impressive performance gain on three prevailing datasets: UCF-Sports, UCF-101 and J-HMDB, which confirm the effectiveness of our enhanced detector for action detection.

引用

页码：1193 / 1204

页数：12

共 52 条

[1] Efficient object tracking using hierarchical convolutional features model and correlation filters
Abbass, Mohammed Y.
Kwon, Ki-Chul
Kim, Nam
Abdelwahab, Safey A.
El-Samie, Fathi E. Abd
Khalaf, Ashraf A. M.
[J]. VISUAL COMPUTER, 2021, 37 (04) : 831 - 842
[2] [Anonymous], 2015, P INT C LEARN REPR I
[3] 3D RANs: 3D Residual Attention Networks for action recognition
Cai, Jiahui
Hu, Jianguo
[J]. VISUAL COMPUTER, 2020, 36 (06) : 1261 - 1270
[4] Carion Nicolas, 2020, Computer Vision - ECCV 2020. 16th European Conference. Proceedings. Lecture Notes in Computer Science (LNCS 12346), P213, DOI 10.1007/978-3-030-58452-8_13
[5] Chen SX, 2019, AAAI CONF ARTIF INTE, P8191
[6] Cuzzolin F, 2016, BMVC
[7] Human action recognition using two-stream attention based LSTM networks
Dai, Cheng
Liu, Xingang
Lai, Jinfeng
[J]. APPLIED SOFT COMPUTING, 2020, 86
[8] Single Shot Video Object Detector
Deng, Jiajun
Pan, Yingwei
Yao, Ting
Zhou, Wengang
Li, Houqiang
Mei, Tao
[J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2021, 23 : 846 - 858
[9] A robust tracking algorithm with on online detector and high-confidence updating strategy
Dong, Enzeng
Deng, Mengtao
Wang, Zenghui
[J]. VISUAL COMPUTER, 2021, 37 (03) : 567 - 585
[10] DAPs: Deep Action Proposals for Action Understanding
Escorcia, Victor
Heilbron, Fabian Caba
Niebles, Juan Carlos
Ghanem, Bernard
[J]. COMPUTER VISION - ECCV 2016, PT III, 2016, 9907 : 768 - 784

← 1 2 3 4 5 6 →