Exploiting Attention-Consistency Loss For Spatial-Temporal Stream Action Recognition

被引：12

作者：

Xu, Haotian ^{[1
]}

Jin, Xiaobo ^{[1
]}

Wang, Qiufeng ^{[1
]}

Hussain, Amir ^{[2
]}

Huang, Kaizhu ^{[3
]}

机构：

[1] Xian Jiaotong Liverpool Univ, 111 Renal Rd, Suzhou 215000, Jiangsu, Peoples R China

[2] Edinburgh Napier Univ, Edinburgh EH11 4BN, Midlothian, Scotland

[3] Duke Kunshan Univ, 8 Duke Ave, Kunshan 215316, Jiangsu, Peoples R China

来源：

ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS | 2022年 / 18卷 / 02期

关键词：

Action recognition; attention consistency; multi-level attention; two-stream structure; FORM;

D O I：

10.1145/3538749

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Currently, many action recognition methods mostly consider the information from spatial streams. We propose a new perspective inspired by the human visual system to combine both spatial and temporal streams to measure their attention consistency. Specifically, a branch-independent convolutional neural network (CNN) based algorithm is developed with a novel attention-consistency loss metric, enabling the temporal stream to concentrate on consistent discriminative regions with the spatial stream in the same period. The consistency loss is further combined with the cross-entropy loss to enhance the visual attention consistency. We evaluate the proposed method for action recognition on two benchmark datasets: Kinetics400 and UCF101. Despite its apparent simplicity, our proposed framework with the attention consistency achieves better performance than most of the two-stream networks, i.e., 75.7% top-1 accuracy on Kinetics400 and 95.7% on UCF101, while reducing 7.1% computational cost compared with our baseline. Particularly, our proposed method can attain remarkable improvements on complex action classes, showing that our proposed network can act as a potential benchmark to handle complicated scenarios in industry 4.0 applications.

引用

页数：15

共 51 条

[1] Arnab A., 2021, arXiv
[2] A Database and Evaluation Methodology for Optical Flow
Baker, Simon
Scharstein, Daniel
Lewis, J. P.
Roth, Stefan
Black, Michael J.
Szeliski, Richard
[J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2011, 92 (01) : 1 - 31
[3] Bertasius G, 2021, Arxiv, DOI arXiv:2102.05095
[4] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
Carreira, Joao
Zisserman, Andrew
[J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 4724 - 4733
[5] Multi-fiber Networks for Video Recognition
Chen, Yunpeng
Kalantidis, Yannis
Li, Jianshu
Yan, Shuicheng
Feng, Jiashi
[J]. COMPUTER VISION - ECCV 2018, PT I, 2018, 11205 : 364 - 380
[6] SPATIAL AND TEMPORAL CONTRAST SENSITIVITIES OF NEURONS IN LATERAL GENICULATE-NUCLEUS OF MACAQUE
DERRINGTON, AM
LENNIE, P
[J]. JOURNAL OF PHYSIOLOGY-LONDON, 1984, 357 (DEC): : 219 - 240
[7] Diba A, 2017, Arxiv, DOI arXiv:1711.08200
[8] Spatio-temporal Channel Correlation Networks for Action Classification
Diba, Ali
Fayyaz, Mohsen
Sharma, Vivek
Arzani, M. Mahdi
Yousefzadeh, Rahman
Gall, Juergen
Van Gool, Luc
[J]. COMPUTER VISION - ECCV 2018, PT IV, 2018, 11208 : 299 - 315
[9] Learning Spatiotemporal Features with 3D Convolutional Networks
Du Tran
Bourdev, Lubomir
Fergus, Rob
Torresani, Lorenzo
Paluri, Manohar
[J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 4489 - 4497
[10] Fan HQ, 2021, Arxiv, DOI arXiv:2104.11227

← 1 2 3 4 5 6 →