Compressed Video Action Recognition With Dual-Stream and Dual-Modal Transformer

被引:4
|
作者
Mou, Yuting [1 ]
Jiang, Xinghao [1 ]
Xu, Ke [1 ]
Sun, Tanfeng [1 ]
Wang, Zepeng [1 ]
机构
[1] Shanghai Jiao Tong Univ, Sch Elect Informat & Elect Engn, Natl Engn Lab Informat Content Anal Tech, Shanghai 200240, Peoples R China
基金
中国国家自然科学基金;
关键词
Compressed video; action recognition; NETWORK; EFFICIENCY;
D O I
10.1109/TCSVT.2023.3319140
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Compressed video action recognition offers the advantage of reducing decoding and inference time compared to the RGB domain. However, the compressed domain poses unique challenges with different types of frames (I-frames and P-frames). I-frames consistent with RGB are rich in frame information, but the redundant information may interfere with the recognition task. There are two modalities in P-frames, residual (R) and motion vector (MV). Although with less information, they can reflect the motion cue. To address these challenges and leverage the independent information from different frames and modalities, we propose a novel approach called Dual-Stream and Dual-Modal Transformer (DSDMT). Our approach consists of two streams: 1) The short-span P-frames stream contains temporal information. We propose the Dual-Modal Attention Module (DAM) to mine different modal variability in P-frames and complement the orthogonal feature vector. Besides, considering the sparsity of P-frames, we extract action features with Frame-level Patch Embedding (FPE) to avoid redundant computation. 2) The long-span I-frames stream extracts the global context feature of the entire video, including content and scene information. By fusing the global video context and local key-frame features, our model represents the action feature in terms of fine-grained and coarse-grained. We evaluated our proposed DSDMT on three public benchmarks with different scales: HMDB-51, UCF-101, and Kinetics-400. Ours achieve better performance with fewer Flops and lower latency. Our analysis shows that the independence and complements of the I-frames and P-frames extracted from the compressed video stream play a crucial role in action recognition.
引用
收藏
页码:3299 / 3312
页数:14
相关论文
共 50 条
  • [41] TBRNet: Two-Stream BiLSTM Residual Network for Video Action Recognition
    Wu, Xiao
    Ji, Qingge
    ALGORITHMS, 2020, 13 (07) : 1 - 21
  • [42] Egocentric Early Action Prediction via Multimodal Transformer-Based Dual Action Prediction
    Guan, Weili
    Song, Xuemeng
    Wang, Kejie
    Wen, Haokun
    Ni, Hongda
    Wang, Yaowei
    Chang, Xiaojun
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (09) : 4472 - 4483
  • [43] Effective PDT/PTT dual-modal phototherapeutic killing of bacteria by using poly(N-phenylglycine) nanoparticles
    Ghayyem, Sena
    Barras, Alexandre
    Faridbod, Farnoush
    Szunerits, Sabine
    Boukherroub, Rabah
    MICROCHIMICA ACTA, 2022, 189 (04)
  • [44] Video action recognition collaborative learning with dynamics via PSO-ConvNet Transformer
    Huu Phong Nguyen
    Ribeiro, Bernardete
    SCIENTIFIC REPORTS, 2023, 13 (01)
  • [45] An Effective Video Transformer With Synchronized Spatiotemporal and Spatial Self-Attention for Action Recognition
    Alfasly, Saghir
    Chui, Charles K.
    Jiang, Qingtang
    Lu, Jian
    Xu, Chen
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (02) : 2496 - 2509
  • [46] k-NN attention-based video vision transformer for action recognition
    Sun, Weirong
    Ma, Yujun
    Wang, Ruili
    NEUROCOMPUTING, 2024, 574
  • [47] Recoverable Dual-Modal Responsive Sensing Materials Based on Mechanoluminescence and Thermally Stimulated Luminescence toward Noncontact Tactile Sensors
    Li, Na
    Yu, Shuaishuai
    Zhao, Lei
    Zhang, Pengfei
    Wang, Ziqi
    Wei, Zhiting
    Chen, Wenbo
    Xu, Xuhui
    INORGANIC CHEMISTRY, 2023, 62 (05) : 2024 - 2032
  • [48] A Stretchable and Transparent Electrode Based on PEGylated Silk Fibroin for In Vivo Dual-Modal Neural-Vascular Activity Probing
    Cui, Yajing
    Zhang, Fan
    Chen, Geng
    Yao, Lin
    Zhang, Nan
    Liu, Zhiyuan
    Li, Qingsong
    Zhang, Feilong
    Cui, Zequn
    Zhang, Keqin
    Li, Peng
    Cheng, Yuan
    Zhang, Shaomin
    Chen, Xiaodong
    ADVANCED MATERIALS, 2021, 33 (34)
  • [49] SmartCamera: Realtime Video Stream-Oriented Action Recognition Platform in Edge Environment
    Zhai, Zhongyi
    Chen, Xiaofeng
    Zhao, Yinduo
    Zhao, Lingzhong
    Wu, Jinsong
    Qian, Junyan
    UBICOMP/ISWC '21 ADJUNCT: PROCEEDINGS OF THE 2021 ACM INTERNATIONAL JOINT CONFERENCE ON PERVASIVE AND UBIQUITOUS COMPUTING AND PROCEEDINGS OF THE 2021 ACM INTERNATIONAL SYMPOSIUM ON WEARABLE COMPUTERS, 2021, : 88 - 89
  • [50] DUAL TEMPORAL TRANSFORMERS FOR FINE-GRAINED DANGEROUS ACTION RECOGNITION
    Song, Wenfeng
    Jin, Xingliang
    Ding, Yang
    Gao, Yang
    Hou, Xia
    2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 415 - 419