Weakly supervised spatial-temporal attention network driven by tracking and consistency loss for action detection

被引:2
作者
Zhu, Jinlei [1 ]
Chen, Houjin [1 ]
Pan, Pan [1 ]
Sun, Jia [1 ]
机构
[1] Beijing Jiaotong Univ, Sch Elect & Informat Engn, Beijing 100044, Peoples R China
关键词
Weakly supervised learning; Consistency loss; Spatial attention; Channel attention;
D O I
10.1186/s13640-022-00588-4
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
This study proposes a novel network model for video action tube detection. This model is based on a location-interactive weakly supervised spatial-temporal attention mechanism driven by multiple loss functions. It is especially costly and time consuming to annotate every target location in video frames. Thus, we first propose a cross-domain weakly supervised learning method with a spatial-temporal attention mechanism for action tube detection. In source domain, we trained a newly designed multi-loss spatial-temporal attention-convolution network on the source data set, which has both object location and classification annotations. In target domain, we introduced internal tracking loss and neighbor-consistency loss; we trained the network with the pre-trained model on the target data set, which only has inaccurate action temporal positions. Although this is a location-unsupervised method, its performance outperforms typical weakly supervised methods, and even shows comparable results with some recent fully supervised methods. We also visualize the activation maps, which reveal the intrinsic reason behind the higher performance of the proposed method.
引用
收藏
页数:18
相关论文
共 47 条
  • [1] Alwando E. H. P., 2018, IEEE T CIRC SYST VID, P104
  • [2] Babenko B, 2009, PROC CVPR IEEE, P983, DOI 10.1109/CVPRW.2009.5206737
  • [3] Cicek O., 2016, ARXIV PREPRINT 2016
  • [4] Learning Spatially Regularized Correlation Filters for Visual Tracking
    Danelljan, Martin
    Hager, Gustav
    Khan, Fahad Shahbaz
    Felsberg, Michael
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 4310 - 4318
  • [5] Convolutional Two-Stream Network Fusion for Video Action Recognition
    Feichtenhofer, Christoph
    Pinz, Axel
    Zisserman, Andrew
    [J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 1933 - 1941
  • [6] Fernando B, 2020, IEEE WINT CONF APPL, P526, DOI [10.1109/WACV45572.2020.9093263, 10.1109/wacv45572.2020.9093263]
  • [7] AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions
    Gu, Chunhui
    Sun, Chen
    Ross, David A.
    Vondrick, Carl
    Pantofaru, Caroline
    Li, Yeqing
    Vijayanarasimhan, Sudheendra
    Toderici, George
    Ricco, Susanna
    Sukthankar, Rahul
    Schmid, Cordelia
    Malik, Jitendra
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6047 - 6056
  • [8] Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet?
    Hara, Kensho
    Kataoka, Hirokatsu
    Satoh, Yutaka
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6546 - 6555
  • [9] High-Speed Tracking with Kernelized Correlation Filters
    Henriques, Joao F.
    Caseiro, Rui
    Martins, Pedro
    Batista, Jorge
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2015, 37 (03) : 583 - 596
  • [10] Tube Convolutional Neural Network (T-CNN) for Action Detection in Videos
    Hou, Rui
    Chen, Chen
    Shah, Mubarak
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 5823 - 5832