Exploiting Attention-Consistency Loss For Spatial-Temporal Stream Action Recognition

被引:12
作者
Xu, Haotian [1 ]
Jin, Xiaobo [1 ]
Wang, Qiufeng [1 ]
Hussain, Amir [2 ]
Huang, Kaizhu [3 ]
机构
[1] Xian Jiaotong Liverpool Univ, 111 Renal Rd, Suzhou 215000, Jiangsu, Peoples R China
[2] Edinburgh Napier Univ, Edinburgh EH11 4BN, Midlothian, Scotland
[3] Duke Kunshan Univ, 8 Duke Ave, Kunshan 215316, Jiangsu, Peoples R China
关键词
Action recognition; attention consistency; multi-level attention; two-stream structure; FORM;
D O I
10.1145/3538749
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Currently, many action recognition methods mostly consider the information from spatial streams. We propose a new perspective inspired by the human visual system to combine both spatial and temporal streams to measure their attention consistency. Specifically, a branch-independent convolutional neural network (CNN) based algorithm is developed with a novel attention-consistency loss metric, enabling the temporal stream to concentrate on consistent discriminative regions with the spatial stream in the same period. The consistency loss is further combined with the cross-entropy loss to enhance the visual attention consistency. We evaluate the proposed method for action recognition on two benchmark datasets: Kinetics400 and UCF101. Despite its apparent simplicity, our proposed framework with the attention consistency achieves better performance than most of the two-stream networks, i.e., 75.7% top-1 accuracy on Kinetics400 and 95.7% on UCF101, while reducing 7.1% computational cost compared with our baseline. Particularly, our proposed method can attain remarkable improvements on complex action classes, showing that our proposed network can act as a potential benchmark to handle complicated scenarios in industry 4.0 applications.
引用
收藏
页数:15
相关论文
共 51 条
  • [1] Arnab A., 2021, arXiv
  • [2] A Database and Evaluation Methodology for Optical Flow
    Baker, Simon
    Scharstein, Daniel
    Lewis, J. P.
    Roth, Stefan
    Black, Michael J.
    Szeliski, Richard
    [J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2011, 92 (01) : 1 - 31
  • [3] Bertasius G, 2021, Arxiv, DOI arXiv:2102.05095
  • [4] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
    Carreira, Joao
    Zisserman, Andrew
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 4724 - 4733
  • [5] Multi-fiber Networks for Video Recognition
    Chen, Yunpeng
    Kalantidis, Yannis
    Li, Jianshu
    Yan, Shuicheng
    Feng, Jiashi
    [J]. COMPUTER VISION - ECCV 2018, PT I, 2018, 11205 : 364 - 380
  • [6] SPATIAL AND TEMPORAL CONTRAST SENSITIVITIES OF NEURONS IN LATERAL GENICULATE-NUCLEUS OF MACAQUE
    DERRINGTON, AM
    LENNIE, P
    [J]. JOURNAL OF PHYSIOLOGY-LONDON, 1984, 357 (DEC): : 219 - 240
  • [7] Diba A, 2017, Arxiv, DOI arXiv:1711.08200
  • [8] Spatio-temporal Channel Correlation Networks for Action Classification
    Diba, Ali
    Fayyaz, Mohsen
    Sharma, Vivek
    Arzani, M. Mahdi
    Yousefzadeh, Rahman
    Gall, Juergen
    Van Gool, Luc
    [J]. COMPUTER VISION - ECCV 2018, PT IV, 2018, 11208 : 299 - 315
  • [9] Learning Spatiotemporal Features with 3D Convolutional Networks
    Du Tran
    Bourdev, Lubomir
    Fergus, Rob
    Torresani, Lorenzo
    Paluri, Manohar
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 4489 - 4497
  • [10] Fan HQ, 2021, Arxiv, DOI arXiv:2104.11227