Infrared Action Detection in the Dark via Cross-Stream Attention Mechanism

被引：24

作者：

Chen, Xu ^{[1
,2
]}

Gao, Chenqiang ^{[1
,2
]}

Li, Chaoyu ^{[3
]}

Yang, Yi ^{[4
]}

Meng, Deyu ^{[5
,6
]}

机构：

[1] Chongqing Univ Posts & Telecommunt, Sch Commun & Informat Engn, Chongqing 400065, Peoples R China

[2] Chongqing Key Lab Signal & Informat Proc, Chongqing 400065, Peoples R China

[3] Chongqing Univ Posts & Telecommun, Sch Automat, Chongqing 400065, Peoples R China

[4] Univ Technol Sydney, Ctr Artificial Intelligence, Ultimo, NSW 2007, Australia

[5] Macau Univ Sci & Technol, Macau Inst Syst Engn, Taipa 999078, Macao, Peoples R China

[6] Xi An Jiao Tong Univ, Sch Math & Stat, Xian 710049, Shanxi, Peoples R China

来源：

IEEE TRANSACTIONS ON MULTIMEDIA | 2022年 / 24卷

基金：

中国国家自然科学基金;

关键词：

Optical imaging; Streaming media; Feature extraction; Proposals; Task analysis; Image recognition; Three-dimensional displays; Infrared video; selective cross-stream attention; temporal action detection; ACTION RECOGNITION; NETWORKS;

D O I：

10.1109/TMM.2021.3050069

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Action detection plays an important role in video understanding and attracts considerable attention in the last decade. However, current action detection methods are mainly based on visible videos, and few of them consider scenes with low-light, where actions are difficult to be detected by existing methods, or even by human eyes. Compared with visible videos, infrared videos are more suitable for the dark environment and resistant to background clutter. In this paper, we investigate the temporal action detection problem in the dark by using infrared videos, which is, to the best of our knowledge, the first attempt in the action detection community. Our model takes the whole video as input, a Flow Estimation Network (FEN) is employed to generate the optical flow for infrared data, and it is optimized with the whole network to obtain action-related motion representations. After feature extraction, the infrared stream and flow stream are fed into a Selective Cross-stream Attention (SCA) module to narrow the performance gap between infrared and visible videos. The SCA emphasizes informative snippets and focuses on the more discriminative stream automatically. Then we adopt a snippet-level classifier to obtain action scores for all snippets and link continuous snippets into final detections. All these modules are trained in an end-to-end manner. We collect an Infrared action Detection (InfDet) dataset obtained in the dark and conduct extensive experiments to verify the effectiveness of the proposed method. Experimental results show that our proposed method surpasses state-of-the-art temporal action detection methods designed for visible videos, and it also achieves the best performance compared with other infrared action recognition methods on both InfAR and Infrared-Visible datasets.

引用

页码：288 / 300

页数：13

共 64 条

[1]

Black, 2018, GCPR, P281

[2] The recognition of human movement using temporal templates [J].

Bobick, AF ;

Davis, JW .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2001, 23 (03) :257-267

[3]

Heilbron FC, 2015, PROC CVPR IEEE, P961, DOI 10.1109/CVPR.2015.7298698

[4] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].

Carreira, Joao ;

Zisserman, Andrew .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733

[5] Rethinking the Faster R-CNN Architecture for Temporal Action Localization [J].

Chao, Yu-Wei ;

Vijayanarasimhan, Sudheendra ;

Seybold, Bryan ;

Ross, David A. ;

Deng, Jia ;

Sukthankar, Rahul .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :1130-1139

[6] FlowNet: Learning Optical Flow with Convolutional Networks [J].

Dosovitskiy, Alexey ;

Fischer, Philipp ;

Ilg, Eddy ;

Haeusser, Philip ;

Hazirbas, Caner ;

Golkov, Vladimir ;

van der Smagt, Patrick ;

Cremers, Daniel ;

Brox, Thomas .

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2758-2766

[7] Learning Spatiotemporal Features with 3D Convolutional Networks [J].

Du Tran ;

Bourdev, Lubomir ;

Fergus, Rob ;

Torresani, Lorenzo ;

Paluri, Manohar .

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :4489-4497

[8]

Duarte K, 2018, ADV NEUR IN, V31

[9] Attention-Based Multiview Re-Observation Fusion Network for Skeletal Action Recognition [J].

Fan, Zhaoxuan ;

Zhao, Xu ;

Lin, Tianwei ;

Su, Haisheng .

IEEE TRANSACTIONS ON MULTIMEDIA, 2019, 21 (02) :363-374

[10] InfAR dataset: Infrared action recognition at different times [J].

Gao, Chenqiang ;

Du, Yinhe ;

Liu, Jiang ;

Lv, Jing ;

Yang, Luyu ;

Meng, Deyu ;

Hauptmann, Alexander G. .

NEUROCOMPUTING, 2016, 212 :36-47

← 1 2 3 4 5 6 7 →