Differential motion attention network for efficient action recognition

被引：0

作者：

Liu, Caifeng ^{[1
]}

Gu, Fangjie ^{[1
]}

机构：

[1] Dalian Univ Technol, Sch Econ & Management, Dalian 116000, Peoples R China

来源：

VISUAL COMPUTER | 2025年 / 41卷 / 03期

关键词：

Action recognition; Temporal reasoning; Differential motion attention; Efficiency; NEURAL-NETWORKS;

D O I：

10.1007/s00371-024-03478-0

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

Despite the great progresses achieved by commonly-used 3D CNNs and two-stream methods in action recognition, they cause heavy computational burden which are inefficient and even infeasible in real-world scenarios. In this paper, we propose differential motion attention network (DMANet) to specially highlight human dynamics toward efficient action recognition. First, we argue that consecutive frames contain redundant static features and construct a low computational unit for discriminative motion extraction to highlight the human action trajectories across consecutive frames. Second, as not all spatial regions in images play an equal role in depicting human actions, we propose an adaptive protocol to dynamically emphasize informative spatial regions. As an end-to-end lightweight framework, our DMANet outperforms costly 3D CNNs and two-stream methods by 2.3% with only 0.23x\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times $$\end{document} computations and other efficient methods by 1.6% on Something-Something v1 dataset. Experimental results on two temporal-related datasets and the large-scale scene-related Kinetics-400 dataset prove the efficacy of DMANet. In-depth ablations further give both quantitative and qualitative support on its effects.

引用

页码：1719 / 1731

页数：13

共 77 条

[1] An Effective Video Transformer With Synchronized Spatiotemporal and Spatial Self-Attention for Action Recognition [J].

Alfasly, Saghir ;

Chui, Charles K. ;

Jiang, Qingtang ;

Lu, Jian ;

Xu, Chen .

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (02) :2496-2509

[2]

[Anonymous], 2017, Aggregated residual transformations for deep neural networks

[3] ViViT: A Video Vision Transformer [J].

Arnab, Anurag ;

Dehghani, Mostafa ;

Heigold, Georg ;

Sun, Chen ;

Lucic, Mario ;

Schmid, Cordelia .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :6816-6826

[4] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].

Carreira, Joao ;

Zisserman, Andrew .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733

[5] MNGNAS: Distilling Adaptive Combination of Multiple Searched Networks for One-Shot Neural Architecture Search [J].

Chen, Zhihua ;

Qiu, Guhao ;

Li, Ping ;

Zhu, Lei ;

Yang, Xiaokang ;

Sheng, Bin .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (11) :13489-13508

[6] MARS: Motion-Augmented RGB Stream for Action Recognition [J].

Crasto, Nieves ;

Weinzaepfel, Philippe ;

Alahari, Karteek ;

Schmid, Cordelia .

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :7874-7883

[7] Learning Spatiotemporal Features with 3D Convolutional Networks [J].

Du Tran ;

Bourdev, Lubomir ;

Fergus, Rob ;

Torresani, Lorenzo ;

Paluri, Manohar .

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :4489-4497

[8]

Fan Q., 2022, P INT C LEARN REPR

[9] SlowFast Networks for Video Recognition [J].

Feichtenhofer, Christoph ;

Fan, Haoqi ;

Malik, Jitendra ;

He, Kaiming .

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :6201-6210

[10] Convolutional Two-Stream Network Fusion for Video Action Recognition [J].

Feichtenhofer, Christoph ;

Pinz, Axel ;

Zisserman, Andrew .

2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :1933-1941

← 1 2 3 4 5 6 7 8 →