Learning SpatioTemporal and Motion Features in a Unified 2D Network for Action Recognition

被引：22

作者：

Wang, Mengmeng ^{[1
]}

Xing, Jiazheng ^{[1
]}

Su, Jing ^{[2
]}

Chen, Jun ^{[1
]}

Liu, Yong ^{[1
]}

机构：

[1] Zhejiang Univ, Coll Control Sci & Engn, Lab Adv Percept Robot & Intelligent Learning, Hangzhou 310027, Zhejiang, Peoples R China

[2] Fudan Univ, Dept Opt Sci & Engn, Shanghai 200433, Peoples R China

来源：

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE | 2023年 / 45卷 / 03期

基金：

中国国家自然科学基金;

关键词：

Action recognition; frequency illustration; motion features; spatiotemporal features; twins training framework; REPRESENTATION;

D O I：

10.1109/TPAMI.2022.3173658

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recent methods for action recognition always apply 3D Convolutional Neural Networks (CNNs) to extract spatiotemporal features and introduce optical flows to present motion features. Although achieving state-of-the-art performance, they are expensive in both time and space. In this paper, we propose to represent both two kinds of features in a unified 2D CNN without any 3D convolution or optical flows calculation. In particular, we first design a channel-wise spatiotemporal module to present the spatiotemporal features and a channel-wise motion module to encode feature-level motion features efficiently. Besides, we provide a distinctive illustration of the two modules from the frequency domain by interpreting them as advanced and learnable versions of frequency components. Second, we combine these two modules and an identity mapping path into one united block that can easily replace the original residual block in the ResNet architecture, forming a simple yet effective network dubbed STM network by introducing very limited extra computation cost and parameters. Third, we propose a novel Twins Training framework for action recognition by incorporating a correlation loss to optimize the inter-class and intra-class correlation and a siamese structure to fully stretch the training data. We extensively validate the proposed STM on both temporal-related datasets (i.e., Something-Something v1 & v2) and scene-related datasets (i.e., Kinetics-400, UCF-101, and HMDB-51). It achieves favorable results against state-of-the-art methods in all the datasets.

引用

页码：3347 / 3362

页数：16

共 50 条

[21] Beyond 2D: Fusion of Monocular 3D Pose, Motion and Appearance for Human Action Recognition [J].

Lin, Wei ;

Yu, Jie .

2019 22ND INTERNATIONAL CONFERENCE ON INFORMATION FUSION (FUSION 2019), 2019,

[22] Recurrent Spatiotemporal Feature Learning for Action Recognition [J].

Chen, Ze ;

Lu, Hongtao .

ICRAI 2018: PROCEEDINGS OF 2018 4TH INTERNATIONAL CONFERENCE ON ROBOTICS AND ARTIFICIAL INTELLIGENCE -, 2018, :12-17

[23] Spatiotemporal Features for Action Recognition and Salient Event Detection [J].

Rapantzikos, Konstantinos ;

Avrithis, Yannis ;

Kollias, Stefanos .

COGNITIVE COMPUTATION, 2011, 3 (01) :167-184

[24] 2D progressive fusion module for action recognition* [J].

Shen, Zhongwei ;

Wu, Xiao-Jun ;

Kittler, Josef .

IMAGE AND VISION COMPUTING, 2021, 109

[25] Spatiotemporal Features for Action Recognition and Salient Event Detection [J].

Konstantinos Rapantzikos ;

Yannis Avrithis ;

Stefanos Kollias .

Cognitive Computation, 2011, 3 :167-184

[26] Action recognition by learning temporal slowness invariant features [J].

Lishen Pei ;

Mao Ye ;

Xuezhuan Zhao ;

Yumin Dou ;

Jiao Bao .

The Visual Computer, 2016, 32 :1395-1404

[27] Action recognition by learning temporal slowness invariant features [J].

Pei, Lishen ;

Ye, Mao ;

Zhao, Xuezhuan ;

Dou, Yumin ;

Bao, Jiao .

VISUAL COMPUTER, 2016, 32 (11) :1395-1404

[28] Efficient 2D viewpoint combination for human action recognition [J].

Behrouz Saghafi ;

Deepu Rajan ;

Wanqing Li .

Pattern Analysis and Applications, 2016, 19 :563-577

[29] Efficient 2D viewpoint combination for human action recognition [J].

Saghafi, Behrouz ;

Rajan, Deepu ;

Li, Wanqing .

PATTERN ANALYSIS AND APPLICATIONS, 2016, 19 (02) :563-577

[30] Spatiotemporal Saliency Representation Learning for Video Action Recognition [J].

Kong, Yongqiang ;

Wang, Yunhong ;

Li, Annan .

IEEE TRANSACTIONS ON MULTIMEDIA, 2022, 24 :1515-1528

← 1 2 3 4 5 →