Action detection with two-stream enhanced detector

被引:3
作者
Zhang, Min [1 ]
Hu, Haiyang [1 ]
Li, Zhongjin [1 ]
Chen, Jie [1 ]
机构
[1] Hangzhou Dianzi Univ, Sch Comp Sci & Technol, Hangzhou, Peoples R China
基金
中国国家自然科学基金;
关键词
Action detection; Spatiotemporal localization; Object detection; Anchor cuboid; ATTENTION;
D O I
10.1007/s00371-021-02397-8
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Action understanding in videos is a challenging task that has attracted widespread attention in recent years. Most current methods localize bounding box of actors at frame level, and then track or link these detections to form action tubes across frames. These methods often focus on utilizing temporal context in videos while neglecting the importance of the detector itself. In this paper, we present a two-stream enhanced framework to deal with the problem of action detection. Specifically, we devise an appearance and motion detectors in two-stream manner to detect actions, which take k consecutive RGB frames and optical flow images as input respectively. To improve the feature presentation capabilities, anchor refinement sub-module with feature alignment is introduced into the two-stream architecture to generate flexible anchor cuboids. Meanwhile, hierarchical fusion strategy is utilized to concatenate intermediate feature maps for capturing fast moving subjects. Moreover, layer normalization with skip connection is adopted to reduce the internal co-variate shift between network layers, which makes the training process simple and effective. Compared to state-of-the-art methods, the proposed approach yields impressive performance gain on three prevailing datasets: UCF-Sports, UCF-101 and J-HMDB, which confirm the effectiveness of our enhanced detector for action detection.
引用
收藏
页码:1193 / 1204
页数:12
相关论文
共 52 条
  • [1] Efficient object tracking using hierarchical convolutional features model and correlation filters
    Abbass, Mohammed Y.
    Kwon, Ki-Chul
    Kim, Nam
    Abdelwahab, Safey A.
    El-Samie, Fathi E. Abd
    Khalaf, Ashraf A. M.
    [J]. VISUAL COMPUTER, 2021, 37 (04) : 831 - 842
  • [2] [Anonymous], 2015, P INT C LEARN REPR I
  • [3] 3D RANs: 3D Residual Attention Networks for action recognition
    Cai, Jiahui
    Hu, Jianguo
    [J]. VISUAL COMPUTER, 2020, 36 (06) : 1261 - 1270
  • [4] Carion Nicolas, 2020, Computer Vision - ECCV 2020. 16th European Conference. Proceedings. Lecture Notes in Computer Science (LNCS 12346), P213, DOI 10.1007/978-3-030-58452-8_13
  • [5] Chen SX, 2019, AAAI CONF ARTIF INTE, P8191
  • [6] Cuzzolin F, 2016, BMVC
  • [7] Human action recognition using two-stream attention based LSTM networks
    Dai, Cheng
    Liu, Xingang
    Lai, Jinfeng
    [J]. APPLIED SOFT COMPUTING, 2020, 86
  • [8] Single Shot Video Object Detector
    Deng, Jiajun
    Pan, Yingwei
    Yao, Ting
    Zhou, Wengang
    Li, Houqiang
    Mei, Tao
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2021, 23 : 846 - 858
  • [9] A robust tracking algorithm with on online detector and high-confidence updating strategy
    Dong, Enzeng
    Deng, Mengtao
    Wang, Zenghui
    [J]. VISUAL COMPUTER, 2021, 37 (03) : 567 - 585
  • [10] DAPs: Deep Action Proposals for Action Understanding
    Escorcia, Victor
    Heilbron, Fabian Caba
    Niebles, Juan Carlos
    Ghanem, Bernard
    [J]. COMPUTER VISION - ECCV 2016, PT III, 2016, 9907 : 768 - 784