Motion-Driven Visual Tempo Learning for Video-Based Action Recognition

被引:53
作者
Liu, Yuanzhong [1 ]
Yuan, Junsong [2 ]
Tu, Zhigang [1 ]
机构
[1] Wuhan Univ, State Key Lab Informat Engn Surveying Mapping & R, Wuhan 430079, Peoples R China
[2] SUNY Buffalo, Comp Sci & Engn Dept, Buffalo, NY 14260 USA
基金
中国国家自然科学基金;
关键词
Feature extraction; Visualization; Correlation; Dynamics; Three-dimensional displays; Semantics; Spatiotemporal phenomena; Action recognition; visual tempo; multi-scale temporal structure; temporal correlation module; REPRESENTATION; NETWORKS;
D O I
10.1109/TIP.2022.3180585
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Action visual tempo characterizes the dynamics and the temporal scale of an action, which is helpful to distinguish human actions that share high similarities in visual dynamics and appearance. Previous methods capture the visual tempo either by sampling raw videos with multiple rates, which require a costly multi-layer network to handle each rate, or by hierarchically sampling backbone features, which rely heavily on high-level features that miss fine-grained temporal dynamics. In this work, we propose a Temporal Correlation Module (TCM), which can be easily embedded into the current action recognition backbones in a plug-in-and-play manner, to extract action visual tempo from low-level backbone features at single-layer remarkably. Specifically, our TCM contains two main components: a Multi-scale Temporal Dynamics Module (MTDM) and a Temporal Attention Module (TAM). MTDM applies a correlation operation to learn pixel-wise fine-grained temporal dynamics for both fast-tempo and slow-tempo. TAM adaptively emphasizes expressive features and suppresses inessential ones via analyzing the global information across various tempos. Extensive experiments conducted on several action recognition benchmarks, e.g. Something-Something V1&V2, Kinetics-400, UCF-101, and HMDB-51, have demonstrated that the proposed TCM is effective to promote the performance of the existing video-based action recognition models for a large margin. The source code is publicly released at https://github.com/zphyix/TCM.
引用
收藏
页码:4104 / 4116
页数:13
相关论文
共 72 条
[1]  
[Anonymous], 2020, P EUR C COMP VIS
[2]  
Arnab A., 2021, P IEEE CVF INT C COM, P6836
[3]   Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].
Carreira, Joao ;
Zisserman, Andrew .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733
[4]   Person Re-Identification via Attention Pyramid [J].
Chen, Guangyi ;
Gu, Tianpei ;
Lu, Jiwen ;
Bao, Jin-An ;
Zhou, Jie .
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2021, 30 :7663-7676
[5]   Backtracking Spatial Pyramid Pooling-Based Image Classifier for Weakly Supervised Top-Down Salient Object Detection [J].
Cholakkal, Hisham ;
Johnson, Jubin ;
Rajan, Deepu .
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2018, 27 (12) :6064-6078
[6]   MARS: Motion-Augmented RGB Stream for Action Recognition [J].
Crasto, Nieves ;
Weinzaepfel, Philippe ;
Alahari, Karteek ;
Schmid, Cordelia .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :7874-7883
[7]   FlowNet: Learning Optical Flow with Convolutional Networks [J].
Dosovitskiy, Alexey ;
Fischer, Philipp ;
Ilg, Eddy ;
Haeusser, Philip ;
Hazirbas, Caner ;
Golkov, Vladimir ;
van der Smagt, Patrick ;
Cremers, Daniel ;
Brox, Thomas .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2758-2766
[8]   Learning Spatiotemporal Features with 3D Convolutional Networks [J].
Du Tran ;
Bourdev, Lubomir ;
Fergus, Rob ;
Torresani, Lorenzo ;
Paluri, Manohar .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :4489-4497
[9]   Representation Learning of Temporal Dynamics for Skeleton-Based Action Recognition [J].
Du, Yong ;
Fu, Yun ;
Wang, Liang .
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2016, 25 (07) :3010-3022
[10]   End-to-End Learning of Motion Representation for Video Understanding [J].
Fan, Lijie ;
Huang, Wenbing ;
Gan, Chuang ;
Ermon, Stefano ;
Gong, Boqing ;
Huang, Junzhou .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6016-6025