Weakly supervised spatial-temporal attention network driven by tracking and consistency loss for action detection

被引：2

作者：

Zhu, Jinlei ^{[1
]}

Chen, Houjin ^{[1
]}

Pan, Pan ^{[1
]}

Sun, Jia ^{[1
]}

机构：

[1] Beijing Jiaotong Univ, Sch Elect & Informat Engn, Beijing 100044, Peoples R China

来源：

EURASIP JOURNAL ON IMAGE AND VIDEO PROCESSING | 2022年 / 2022卷 / 01期

关键词：

Weakly supervised learning; Consistency loss; Spatial attention; Channel attention;

D O I：

10.1186/s13640-022-00588-4

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

This study proposes a novel network model for video action tube detection. This model is based on a location-interactive weakly supervised spatial-temporal attention mechanism driven by multiple loss functions. It is especially costly and time consuming to annotate every target location in video frames. Thus, we first propose a cross-domain weakly supervised learning method with a spatial-temporal attention mechanism for action tube detection. In source domain, we trained a newly designed multi-loss spatial-temporal attention-convolution network on the source data set, which has both object location and classification annotations. In target domain, we introduced internal tracking loss and neighbor-consistency loss; we trained the network with the pre-trained model on the target data set, which only has inaccurate action temporal positions. Although this is a location-unsupervised method, its performance outperforms typical weakly supervised methods, and even shows comparable results with some recent fully supervised methods. We also visualize the activation maps, which reveal the intrinsic reason behind the higher performance of the proposed method.

引用

页数：18

共 47 条

[1] Alwando E. H. P., 2018, IEEE T CIRC SYST VID, P104
[2] Babenko B, 2009, PROC CVPR IEEE, P983, DOI 10.1109/CVPRW.2009.5206737
[3] Cicek O., 2016, ARXIV PREPRINT 2016
[4] Learning Spatially Regularized Correlation Filters for Visual Tracking
Danelljan, Martin
Hager, Gustav
Khan, Fahad Shahbaz
Felsberg, Michael
[J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 4310 - 4318
[5] Convolutional Two-Stream Network Fusion for Video Action Recognition
Feichtenhofer, Christoph
Pinz, Axel
Zisserman, Andrew
[J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 1933 - 1941
[6] Fernando B, 2020, IEEE WINT CONF APPL, P526, DOI [10.1109/WACV45572.2020.9093263, 10.1109/wacv45572.2020.9093263]
[7] AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions
Gu, Chunhui
Sun, Chen
Ross, David A.
Vondrick, Carl
Pantofaru, Caroline
Li, Yeqing
Vijayanarasimhan, Sudheendra
Toderici, George
Ricco, Susanna
Sukthankar, Rahul
Schmid, Cordelia
Malik, Jitendra
[J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6047 - 6056
[8] Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet?
Hara, Kensho
Kataoka, Hirokatsu
Satoh, Yutaka
[J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6546 - 6555
[9] High-Speed Tracking with Kernelized Correlation Filters
Henriques, Joao F.
Caseiro, Rui
Martins, Pedro
Batista, Jorge
[J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2015, 37 (03) : 583 - 596
[10] Tube Convolutional Neural Network (T-CNN) for Action Detection in Videos
Hou, Rui
Chen, Chen
Shah, Mubarak
[J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 5823 - 5832

← 1 2 3 4 5 →