Temporal Pyramid Pooling-Based Convolutional Neural Network for Action Recognition

被引：97

作者：

Wang, Peng ^{[1
]}

Cao, Yuanzhouhan ^{[2
]}

Shen, Chunhua ^{[2
,3
]}

Liu, Lingqiao ^{[2
]}

Shen, Heng Tao ^{[1
]}

机构：

[1] Univ Queensland, Sch Informat Technol & Elect Engn, St Lucia, Qld 4072, Australia

[2] Univ Adelaide, Sch Comp Sci, Adelaide, SA 5005, Australia

[3] Australian Ctr Robot Vis, Brisbane, Qld 4000, Australia

来源：

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY | 2017年 / 27卷 / 12期

基金：

澳大利亚研究理事会;

关键词：

Action recognition; convolutional neural network (CNN); temporal pyramid pooling;

D O I：

10.1109/TCSVT.2016.2576761

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Encouraged by the success of convolutional neural networks (CNNs) in image classification, recently much effort is spent on applying the CNNs to the video-based action recognition problems. One challenge is that a video contains a varying number of frames, which is incompatible to the standard input format of the CNNs. Existing methods handle this issue either by directly sampling a fixed number of frames or bypassing this issue by introducing a 3D convolutional layer, which conducts convolution in spatial-temporal domain. In this paper, we propose a novel network structure, which allows an arbitrary number of frames as the network input. The key to our solution is to introduce a module consisting of an encoding layer and a temporal pyramid pooling layer. The encoding layer maps the activation from the previous layers to a feature vector suitable for pooling, whereas the temporal pyramid pooling layer converts multiple frame-level activations into a fixed-length video-level representation. In addition, we adopt a feature concatenation layer that combines the appearance and motion information. Compared with the frame sampling strategy, our method avoids the risk of missing any important frames. Compared with the 3D convolutional method, which requires a huge video data set for network training, our model can be learned on a small target data set because we can leverage the off-the-shelf image-level CNN for model parameter initialization. Experiments on three challenging data sets, Hollywood2, HMDB51, and UCF101 demonstrate the effectiveness of the proposed network.

引用

页码：2613 / 2622

页数：10

共 50 条

[21] Temporal Receptive Field Graph Convolutional Network for Skeleton-based Action Recognition [J].

Zhang, Qingqi ;

Wu, Ren ;

Nakata, Mitsuru ;

Ge, Qi-Wei .

2024 INTERNATIONAL TECHNICAL CONFERENCE ON CIRCUITS/SYSTEMS, COMPUTERS, AND COMMUNICATIONS, ITC-CSCC 2024, 2024,

[22] Temporal Spiking Recurrent Neural Network for Action Recognition [J].

Wang, Wei ;

Hao, Siyuan ;

Wei, Yunchao ;

Xia, Shengtao ;

Feng, Jiashi ;

Sebe, Nicu .

IEEE ACCESS, 2019, 7 :117165-117175

[23] Convolutional Neural Network-Based Video Super-Resolution for Action Recognition [J].

Zhang, Haochen ;

Liu, Dong ;

Xiong, Zhiwei .

PROCEEDINGS 2018 13TH IEEE INTERNATIONAL CONFERENCE ON AUTOMATIC FACE & GESTURE RECOGNITION (FG 2018), 2018, :746-750

[24] Local Feature Fusion Temporal Convolutional Network for Human Action Recognition [J].

Song Z. ;

Zhou Y. ;

Jia J. ;

Xin S. ;

Liu Y. .

Jisuanji Fuzhu Sheji Yu Tuxingxue Xuebao/Journal of Computer-Aided Design and Computer Graphics, 2020, 32 (03) :418-424

[25] An improved spatial temporal graph convolutional network for robust skeleton-based action recognition [J].

Yuling Xing ;

Jia Zhu ;

Yu Li ;

Jin Huang ;

Jinlong Song .

Applied Intelligence, 2023, 53 :4592-4608

[26] Skeleton Action Recognition Based on Spatio-temporal Feature Enhanced Graph Convolutional Network [J].

Cao, Yi ;

Wu, Weiguan ;

Li, Ping ;

Xia, Yu ;

Gao, Qingyuan .

JOURNAL OF ELECTRONICS & INFORMATION TECHNOLOGY, 2023, 45 (08) :3022-3031

[27] Multi-scale temporal feature-based dense convolutional network for action recognition [J].

Li, Xiaoqiang ;

Xie, Miao ;

Zhang, Yin ;

Li, Jide .

JOURNAL OF ELECTRONIC IMAGING, 2020, 29 (06)

[28] Spatial-Temporal Adaptive Graph Convolutional Network for Skeleton-Based Action Recognition [J].

Hang, Rui ;

Li, MinXian .

COMPUTER VISION - ACCV 2022, PT IV, 2023, 13844 :172-188

[29] An improved spatial temporal graph convolutional network for robust skeleton-based action recognition [J].

Xing, Yuling ;

Zhu, Jia ;

Li, Yu ;

Huang, Jin ;

Song, Jinlong .

APPLIED INTELLIGENCE, 2023, 53 (04) :4592-4608

[30] Multiple temporal scale aggregation graph convolutional network for skeleton-based action recognition [J].

Li, Xuanfeng ;

Lu, Jian ;

Zhou, Jian ;

Liu, Wei ;

Zhang, Kaibing .

COMPUTERS & ELECTRICAL ENGINEERING, 2023, 110

← 1 2 3 4 5 →