Learning Visual Tempo for Action Recognition

被引：0

作者：

Nie, Mu ^{[1
]}

Yang, Sen ^{[2
]}

Yang, Wankou ^{[2
]}

机构：

[1] Southeast Univ, Sch Cyber Sci & Engn, Nanjing 210096, Peoples R China

[2] Southeast Univ, Sch Automat, Nanjing 210096, Peoples R China

来源：

ARTIFICIAL INTELLIGENCE AND ROBOTICS, ISAIR 2022, PT I | 2022年 / 1700卷

关键词：

Action recognition; Spatiotemporal; Multi-receptive field; Visual tempo; NETWORK;

D O I：

10.1007/978-981-19-7946-0_13

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The variation of visual tempo, which is an essential feature in action recognition, characterizes the spatiotemporal scale of the action and the dynamics. Existing models usually use spatiotemporal convolution to understand spatiotemporal scenarios. However, they cannot cope with the difference in the visual tempo changes, due to the limited view of temporal and spatial dimensions. To address these issues, we propose a multi-receptive field spatiotemporal (MRF-ST) network in this paper, to effectively model the spatial and temporal information. We utilize dilated convolutions to obtain different receptive fields and design dynamic weighting with different dilation rates based on the attention mechanism. In the proposed network, the MRF-ST network can directly obtain various tempos in the same network layer without any additional learning cost. Moreover, the network can improve the accuracy of action recognition by learning more visual tempo of different actions. Extensive evaluations show that MRF-ST reaches the state-of-the-art on the UCF-101 and HMDB-51 datasets. Further analysis also indicates that MRF-ST can significantly improve the performance at the scenes with large variances in visual tempo.

引用

页码：139 / 155

页数：17

共 48 条

[41]

Zheng Q., 2021, IEEE Trans. Knowledge Data Eng.

[42] Spatial-temporal pyramid based Convolutional Neural Network for action recognition [J].

Zheng, Zhenxing ;

An, Gaoyun ;

Wu, Dapeng ;

Ruan, Qiuqi .

NEUROCOMPUTING, 2019, 358 :446-455

[43] Temporal Relational Reasoning in Videos [J].

Zhou, Bolei ;

Andonian, Alex ;

Oliva, Aude ;

Torralba, Antonio .

COMPUTER VISION - ECCV 2018, PT I, 2018, 11205 :831-846

[44] MiCT: Mixed 3D/2D Convolutional Tube for Human Action Recognition [J].

Zhou, Yizhou ;

Sun, Xiaoyan ;

Zha, Zheng-Jun ;

Zeng, Wenjun .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :449-458

[45] Random Temporal Skipping for Multirate Video Analysis [J].

Zhu, Yi ;

Newsam, Shawn .

COMPUTER VISION - ACCV 2018, PT III, 2019, 11363 :542-557

[46] Spatiotemporal attention enhanced features fusion network for action recognition [J].

Zhuang, Danfeng ;

Jiang, Min ;

Kong, Jun ;

Liu, Tianshan .

INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2021, 12 (03) :823-841

[47] Semantic human activity recognition: A literature review [J].

Ziaeefard, Maryarn ;

Bergevin, Robert .

PATTERN RECOGNITION, 2015, 48 (08) :2329-2345

[48] ECO: Efficient Convolutional Network for Online Video Understanding [J].

Zolfaghari, Mohammadreza ;

Singh, Kamaljeet ;

Brox, Thomas .

COMPUTER VISION - ECCV 2018, PT II, 2018, 11206 :713-730

← 1 2 3 4 5 →