STAR: Efficient SpatioTemporal Modeling for Action Recognition

被引:2
|
作者
Kumar, Abhijeet [1 ]
Abrams, Samuel [1 ]
Kumar, Abhishek [1 ]
Narayanan, Vijaykrishnan [1 ]
机构
[1] Penn State Univ, EECS Dept, State Coll, PA 16802 USA
关键词
Action recognition; Compressed domain; I-frames; Spatial-temporal 2D convolutional networks; DOMAIN;
D O I
10.1007/s00034-022-02160-x
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Action recognition in video has gained significant attention over the past several years. While conventional 2D CNNs have found great success in understanding images, they are not as effective in capturing temporal relationships present in video. By contrast, 3D CNNs capture spatiotemporal information well, but they incur a high computational cost, making deployment challenging. In video, key information is typically confined to a small number of frames, though many current approaches require decompressing and processing all frames, which wastes resources. Others work directly on the compressed domain but require multiple input streams to understand the data. In our work, we directly operate on compressed video and extract information solely from intracoded frames (I-frames) avoiding the use of motion vectors and residuals for motion information making this a single-stream network. This reduces processing time and energy consumption, by extension, making this approach more accessible for a wider range of machines and uses. Extensive testing is employed on the UCF101 (Soomro et al. in UCF101: a dataset of 101 human actions classes from videos in the Wild, 2012) and HMDB51 (Kuehne et al., in: Jhuang, Garrote, Poggio, Serre (eds) Proceedings of the international conference on computer vision (ICCV), 2011) datasets to evaluate our framework and show that computational complexity is reduced significantly while achieving competitive accuracy to existing compressed domain efforts, i.e., 92.6% top1 accuracy in UCF-101 and 62.9% in HMDB-51 dataset with 24.3M parameters and 4 GFLOPS and energy savings of over 11 x for the two datasets versus CoViAR (Wu et al. in Compressed video action recognition, 2018).
引用
收藏
页码:705 / 723
页数:19
相关论文
共 50 条
  • [41] HUMAN ACTION REPRESENTATION AND RECOGNITION: AN APPROACH TO A HISTOGRAM OF SPATIOTEMPORAL TEMPLATES
    Ahsan, Sk Md. Masudul
    Tan, Joo Kooi
    Kim, Hyoungseop
    Ishikawa, Seiji
    INTERNATIONAL JOURNAL OF INNOVATIVE COMPUTING INFORMATION AND CONTROL, 2015, 11 (06): : 1855 - 1867
  • [42] A Spatiotemporal Heterogeneous Two-Stream Network for Action Recognition
    Chen, Enqing
    Bai, Xue
    Gao, Lei
    Tinega, Haron Chweya
    Ding, Yingqiang
    IEEE ACCESS, 2019, 7 : 57267 - 57275
  • [43] A Spatiotemporal Fusion Network For Skeleton-Based Action Recognition
    Bao, Wenxia
    Wang, Junyi
    Yang, Xianjun
    Chen, Hemu
    2024 3RD INTERNATIONAL CONFERENCE ON IMAGE PROCESSING AND MEDIA COMPUTING, ICIPMC 2024, 2024, : 347 - 352
  • [44] Action-Vectors: Unsupervised movement modeling for action recognition
    Roy, Debaditya
    Murty, K. Sri Rama
    Mohan, C. Krishna
    2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2017, : 1602 - 1606
  • [45] A Robust and Efficient Video Representation for Action Recognition
    Wang, Heng
    Oneata, Dan
    Verbeek, Jakob
    Schmid, Cordelia
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2016, 119 (03) : 219 - 238
  • [46] A Robust and Efficient Video Representation for Action Recognition
    Heng Wang
    Dan Oneata
    Jakob Verbeek
    Cordelia Schmid
    International Journal of Computer Vision, 2016, 119 : 219 - 238
  • [47] ActioNet: A Lightweight Architecture for Efficient Action Recognition
    Mokhtari, Mohammed El Amine
    Ennadifi, Elias
    Mancas, Matei
    Gosselin, Bernard
    29TH ACM SYMPOSIUM ON VIRTUAL REALITY SOFTWARE AND TECHNOLOGY, VRST 2023, 2023,
  • [48] EFFICIENT OBJECT FEATURE SELECTION FOR ACTION RECOGNITION
    Zhang, Tianyi
    Zhang, Yu
    Cai, Jianfei
    Kot, Alex C.
    2016 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING PROCEEDINGS, 2016, : 2707 - 2711
  • [49] Local motion feature extraction and spatiotemporal attention mechanism for action recognition
    Song, Xiaogang
    Zhang, Dongdong
    Liang, Li
    He, Min
    Hei, Xinhong
    VISUAL COMPUTER, 2024, 40 (11): : 7747 - 7759
  • [50] Spatiotemporal distilled dense-connectivity network for video action recognition
    Hao, Wangli
    Zhang, Zhaoxiang
    PATTERN RECOGNITION, 2019, 92 : 13 - 24