Learning Spatial and Temporal Extents of Human Actions for Action Detection

被引:38
|
作者
Zhou, Zhong [1 ]
Shi, Feng [1 ]
Wu, Wei [1 ]
机构
[1] Beihang Univ, State Key Lab Virtual Real Technol & Syst, Beijing 100191, Peoples R China
关键词
Action localization; action recognition; discriminative latent variable model; split-and-merge; FRAMEWORK; MODELS;
D O I
10.1109/TMM.2015.2404779
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
For the problem of action detection, most existing methods require that relevant portions of the action of interest in training videos have been manually annotated with bounding boxes. Some recent works tried to avoid tedious manual annotation, and proposed to automatically identify the relevant portions in training videos. However, these methods only concerned the identification in either spatial or temporal domain, and may get irrelevant contents from another domain. These irrelevant contents are usually undesirable in the training phase, which will lead to a degradation of the detection performance. This paper advances prior work by proposing a joint learning framework to simultaneously identify the spatial and temporal extents of the action of interest in training videos. To get pixel-level localization results, our method uses dense trajectories extracted from videos as local features to represent actions. We first present a trajectory split-and-merge algorithm to segment a video into the background and several separated foreground moving objects. In this algorithm, the inherent temporal smoothness of human actions is exploited to facilitate segmentation. Then, with the latent SVM framework on segmentation results, spatial and temporal extents of the action of interest are treated as latent variables that are inferred simultaneously with action recognition. Experiments on two challenging datasets show that action detection with our learned spatial and temporal extents is superior than state-of-the-art methods.
引用
收藏
页码:512 / 525
页数:14
相关论文
共 50 条
  • [21] Robust Human Action Recognition Using Global Spatial-Temporal Attention for Human Skeleton Data
    Han, Yun
    Chung, Sheng-Luen
    Ambikapathi, ArulMurugan
    Chan, Jui-Shan
    Lin, Wei-You
    Su, Shun-Feng
    2018 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2018,
  • [22] Human Action Recognition by Fusion of Convolutional Neural Networks and spatial-temporal Information
    Li, Weisheng
    Ding, Yahui
    8TH INTERNATIONAL CONFERENCE ON INTERNET MULTIMEDIA COMPUTING AND SERVICE (ICIMCS2016), 2016, : 255 - 259
  • [23] Spatial-Temporal Action Localization With Hierarchical Self-Attention
    Pramono, Rizard Renanda Adhi
    Chen, Yie-Tarng
    Fang, Wen-Hsien
    IEEE TRANSACTIONS ON MULTIMEDIA, 2022, 24 : 625 - 639
  • [24] Spatial-Temporal Information Aggregation and Cross-Modality Interactive Learning for RGB-D-Based Human Action Recognition
    Cheng, Qin
    Liu, Zhen
    Ren, Ziliang
    Cheng, Jun
    Liu, Jianming
    IEEE ACCESS, 2022, 10 : 104190 - 104201
  • [25] Spatial-Temporal Attention for Action Recognition
    Sun, Dengdi
    Wu, Hanqing
    Ding, Zhuanlian
    Luo, Bin
    Tang, Jin
    ADVANCES IN MULTIMEDIA INFORMATION PROCESSING, PT I, 2018, 11164 : 854 - 864
  • [26] Spatio-Temporal Analysis for Human Action Detection and Recognition in Uncontrolled Environments
    Liu, Dianting
    Yan, Yilin
    Shyu, Mei-Ling
    Zhao, Guiru
    Chen, Min
    INTERNATIONAL JOURNAL OF MULTIMEDIA DATA ENGINEERING & MANAGEMENT, 2015, 6 (01) : 1 - 18
  • [27] Rotation-based spatial–temporal feature learning from skeleton sequences for action recognition
    Xing Liu
    Yanshan Li
    Rongjie Xia
    Signal, Image and Video Processing, 2020, 14 : 1227 - 1234
  • [28] Temporal-Spatial Mapping for Action Recognition
    Song, Xiaolin
    Lan, Cuiling
    Zeng, Wenjun
    Xing, Junliang
    Sun, Xiaoyan
    Yang, Jingyu
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2020, 30 (03) : 748 - 759
  • [29] Action Progression Networks for Temporal Action Detection in Videos
    Lu, Chong-Kai
    Mak, Man-Wai
    Li, Ruimin
    Chi, Zheru
    Fu, Hong
    IEEE ACCESS, 2024, 12 : 126829 - 126844
  • [30] Streamer action recognition in live video with spatial-temporal attention and deep dictionary learning
    Li, Chenhao
    Zhang, Jing
    Yao, Jiacheng
    NEUROCOMPUTING, 2021, 453 : 383 - 392