Unified Spatio-Temporal Attention Networks for Action Recognition in Videos

被引:100
|
作者
Li, Dong [1 ,2 ]
Yao, Ting [3 ]
Duan, Ling-Yu [4 ]
Mei, Tao [5 ,6 ]
Rui, Yong [7 ]
机构
[1] Univ Sci & Technol China, Hefei 230000, Anhui, Peoples R China
[2] Univ Sci & Technol China, Dept Elect Engn & Informat Sci, Hefei 230000, Anhui, Peoples R China
[3] Microsoft Res, Multimedia Search & Mining Grp, Beijing 100080, Peoples R China
[4] Peking Univ, Natl Engn Lab Video Technol, Sch Elect Engn & Comp Sci, Beijing 100080, Peoples R China
[5] JD AI Res, Beijing 100101, Peoples R China
[6] JD AI Res, Comp Vis & Multimedia Lab, Beijing 100101, Peoples R China
[7] Lenovo, Beijing 100085, Peoples R China
关键词
Action recognition; spatio-temporal attention; deep convolutional networks; MODEL;
D O I
10.1109/TMM.2018.2862341
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Recognizing actions in videos is not a trivial task because video is an information-intensive media and includes multiple modalities. Moreover, on each modality, an action may only appear at some spatial regions, or only part of the temporal video segments may contain the action. A valid question is how to locate the attended spatial areas and selective video segments for action recognition. In this paper, we devise a general attention neural cell, called AttCell, that estimates the attention probability not only at each spatial location but also for each video segment in a temporal sequence. With AttCell, a unified Spatio-Temporal Attention Networks (STAN) is proposed in the context of multiple modalities. Specifically, STAN extracts the feature map of one convolutional layer as the local descriptors on each modality and pools the extracted descriptors with the spatial attention measured by AttCell as a representation of each segment. Then, we concatenate the representation on each modality to seek a consensus on the temporal attention, a priori, to holistically fuse the combined representation of video segments to the video representation for recognition. Our model differs from conventional deep networks, which focus on the attention mechanism, because our temporal attention provides a principled and global guidance across different modalities and video segments. Extensive experiments are conducted on four public datasets; UCF101, CCV, THUMOS14, and Sports-1M; our STAN consistently achieves superior results over several state-of-the-art techniques. More remarkably, we validate and demonstrate the effectiveness of our proposal when capitalizing on the different number of modalities.
引用
收藏
页码:416 / 428
页数:13
相关论文
共 50 条
  • [1] Spatio-Temporal Attention Networks for Action Recognition and Detection
    Li, Jun
    Liu, Xianglong
    Zhang, Wenxuan
    Zhang, Mingyuan
    Song, Jingkuan
    Sebe, Nicu
    IEEE TRANSACTIONS ON MULTIMEDIA, 2020, 22 (11) : 2990 - 3001
  • [2] Spatio-Temporal Fusion Networks for Action Recognition
    Cho, Sangwoo
    Foroosh, Hassan
    COMPUTER VISION - ACCV 2018, PT I, 2019, 11361 : 347 - 364
  • [3] Spatio-Temporal VLAD Encoding for Human Action Recognition in Videos
    Duta, Ionut C.
    Ionescu, Bogdan
    Aizawa, Kiyoharu
    Sebe, Nicu
    MULTIMEDIA MODELING (MMM 2017), PT I, 2017, 10132 : 365 - 378
  • [4] Interpretable Spatio-temporal Attention for Video Action Recognition
    Meng, Lili
    Zhao, Bo
    Chang, Bo
    Huang, Gao
    Sun, Wei
    Tung, Frederich
    Sigal, Leonid
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW), 2019, : 1513 - 1522
  • [5] Action Recognition in Videos with Spatio-Temporal Fusion 3D Convolutional Neural Networks
    Wang, Y.
    Shen, X. J.
    Chen, H. P.
    Sun, J. X.
    PATTERN RECOGNITION AND IMAGE ANALYSIS, 2021, 31 (03) : 580 - 587
  • [6] Action Recognition in Videos with Spatio-Temporal Fusion 3D Convolutional Neural Networks
    Y. Wang
    X. J. Shen
    H. P. Chen
    J. X. Sun
    Pattern Recognition and Image Analysis, 2021, 31 : 580 - 587
  • [7] Spatio-Temporal Human-Object Interactions for Action Recognition in Videos
    Escorcia, Victor
    Carlos Niebles, Juan
    2013 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW), 2013, : 508 - 514
  • [8] STCA: an action recognition network with spatio-temporal convolution and attention
    Tian, Qiuhong
    Miao, Weilun
    Zhang, Lizao
    Yang, Ziyu
    Yu, Yang
    Zhao, Yanying
    Yao, Lan
    INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2025, 14 (01)
  • [9] A Spatio-Temporal Deep Learning Approach For Human Action Recognition in Infrared Videos
    Shah, Anuj K.
    Ghosh, Ripul
    Akula, Aparna
    OPTICS AND PHOTONICS FOR INFORMATION PROCESSING XII, 2018, 10751
  • [10] Spatio-Temporal Vector of Locally Max Pooled Features for Action Recognition in Videos
    Duta, Ionut Cosmin
    Ionescu, Bogdan
    Aizawa, Kiyoharu
    Sebe, Nicu
    30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 3205 - 3214