Improved SSD using deep multi-scale attention spatial-temporal features for action recognition

被引：3

作者：

Zhou, Shuren ^{[1
]}

Qiu, Jia ^{[1
]}

Solanki, Arun ^{[2
]}

机构：

[1] Changsha Univ Sci & Technol, Sch Comp & Commun Engn, Changsha 410114, Peoples R China

[2] Gautam Buddha Univ, Sch Informat & Commun Technol, Noida, India

来源：

MULTIMEDIA SYSTEMS | 2022年 / 28卷 / 06期

关键词：

Action recognition; Multi-scale spatial-temporal feature; Attention mechanism;

D O I：

10.1007/s00530-021-00831-4

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

The biggest difference between video-based action recognition and image-based action recognition is that the former has an extra feature of time dimension. Most methods of action recognition based on deep learning adopt: (1) using 3D convolution to modeling the temporal features; (2) introducing an auxiliary temporal feature, such as optical flow. However, the 3D convolution network usually consumes huge computational resources. The extraction of optical flow requires an extra tedious process with an extra space for storage, and is usually modeled for short-range temporal features. To construct the temporal features better, in this paper we propose a multi-scale attention spatial-temporal features network based on SSD, by means of piecewise on long range of the whole video sequence to sparse sampling of video, using the self-attention mechanism to capture the relation between one frame and the sequence of frames sampled on the entire range of video, making the network notice the representative frames on the sequence. Moreover, the attention mechanism is used to assign different weights to the inter-frame relations representing different time scales, so as to reasoning the contextual relations of actions in the time dimension. Our proposed method achieves competitive performance on two commonly used datasets: UCF101 and HMDB51.

引用

页码：2123 / 2131

页数：9

共 51 条

[21]

Kuehne H, 2011, IEEE I CONF COMP VIS, P2556, DOI 10.1109/ICCV.2011.6126543

[22] On space-time interest points [J].

Laptev, I .

INTERNATIONAL JOURNAL OF COMPUTER VISION, 2005, 64 (2-3) :107-123

[23] Learning realistic human actions from movies [J].

Laptev, Ivan ;

Marszalek, Marcin ;

Schmid, Cordelia ;

Rozenfeld, Benjamin .

2008 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, VOLS 1-12, 2008, :3222-+

[24] Collaborative Spatiotemporal Feature Learning for Video Action Recognition [J].

Li, Chao ;

Zhong, Qiaoyong ;

Xie, Di ;

Pu, Shiliang .

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :7864-7873

[25] Boosted key-frame selection and correlated pyramidal motion-feature representation for human action recognition [J].

Liu, Li ;

Shao, Ling ;

Rockett, Peter .

PATTERN RECOGNITION, 2013, 46 (07) :1810-1818

[26] SSD: Single Shot MultiBox Detector [J].

Liu, Wei ;

Anguelov, Dragomir ;

Erhan, Dumitru ;

Szegedy, Christian ;

Reed, Scott ;

Fu, Cheng-Yang ;

Berg, Alexander C. .

COMPUTER VISION - ECCV 2016, PT I, 2016, 9905 :21-37

[27] Separable reversible data hiding and encryption for HEVC video [J].

Long, Min ;

Peng, Fei ;

Li, Han-yun .

JOURNAL OF REAL-TIME IMAGE PROCESSING, 2018, 14 (01) :171-182

[28]

Mnih V, 2014, ADV NEUR IN, V27

[29] Two-Stream Collaborative Learning With Spatial-Temporal Attention for Video Classification [J].

Peng, Yuxin ;

Zhao, Yunzhen ;

Zhang, Junchao .

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2019, 29 (03) :773-786

[30]

Piergiovanni A., 2018, ARXIV181001455

← 1 2 3 4 5 6 →