Compressed Video Action Recognition With Dual-Stream and Dual-Modal Transformer

被引：4

作者：

Mou, Yuting ^{[1
]}

Jiang, Xinghao ^{[1
]}

Xu, Ke ^{[1
]}

Sun, Tanfeng ^{[1
]}

Wang, Zepeng ^{[1
]}

机构：

[1] Shanghai Jiao Tong Univ, Sch Elect Informat & Elect Engn, Natl Engn Lab Informat Content Anal Tech, Shanghai 200240, Peoples R China

来源：

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY | 2024年 / 34卷 / 05期

基金：

中国国家自然科学基金;

关键词：

Compressed video; action recognition; NETWORK; EFFICIENCY;

D O I：

10.1109/TCSVT.2023.3319140

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Compressed video action recognition offers the advantage of reducing decoding and inference time compared to the RGB domain. However, the compressed domain poses unique challenges with different types of frames (I-frames and P-frames). I-frames consistent with RGB are rich in frame information, but the redundant information may interfere with the recognition task. There are two modalities in P-frames, residual (R) and motion vector (MV). Although with less information, they can reflect the motion cue. To address these challenges and leverage the independent information from different frames and modalities, we propose a novel approach called Dual-Stream and Dual-Modal Transformer (DSDMT). Our approach consists of two streams: 1) The short-span P-frames stream contains temporal information. We propose the Dual-Modal Attention Module (DAM) to mine different modal variability in P-frames and complement the orthogonal feature vector. Besides, considering the sparsity of P-frames, we extract action features with Frame-level Patch Embedding (FPE) to avoid redundant computation. 2) The long-span I-frames stream extracts the global context feature of the entire video, including content and scene information. By fusing the global video context and local key-frame features, our model represents the action feature in terms of fine-grained and coarse-grained. We evaluated our proposed DSDMT on three public benchmarks with different scales: HMDB-51, UCF-101, and Kinetics-400. Ours achieve better performance with fewer Flops and lower latency. Our analysis shows that the independence and complements of the I-frames and P-frames extracted from the compressed video stream play a crucial role in action recognition.

引用

页码：3299 / 3312

页数：14

共 50 条

[41] TBRNet: Two-Stream BiLSTM Residual Network for Video Action Recognition
Wu, Xiao
Ji, Qingge
ALGORITHMS, 2020, 13 (07) : 1 - 21
[42] Egocentric Early Action Prediction via Multimodal Transformer-Based Dual Action Prediction
Guan, Weili
Song, Xuemeng
Wang, Kejie
Wen, Haokun
Ni, Hongda
Wang, Yaowei
Chang, Xiaojun
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (09) : 4472 - 4483
[43] Effective PDT/PTT dual-modal phototherapeutic killing of bacteria by using poly(N-phenylglycine) nanoparticles
Ghayyem, Sena
Barras, Alexandre
Faridbod, Farnoush
Szunerits, Sabine
Boukherroub, Rabah
MICROCHIMICA ACTA, 2022, 189 (04)
[44] Video action recognition collaborative learning with dynamics via PSO-ConvNet Transformer
Huu Phong Nguyen
Ribeiro, Bernardete
SCIENTIFIC REPORTS, 2023, 13 (01)
[45] An Effective Video Transformer With Synchronized Spatiotemporal and Spatial Self-Attention for Action Recognition
Alfasly, Saghir
Chui, Charles K.
Jiang, Qingtang
Lu, Jian
Xu, Chen
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (02) : 2496 - 2509
[46] k-NN attention-based video vision transformer for action recognition
Sun, Weirong
Ma, Yujun
Wang, Ruili
NEUROCOMPUTING, 2024, 574
[47] Recoverable Dual-Modal Responsive Sensing Materials Based on Mechanoluminescence and Thermally Stimulated Luminescence toward Noncontact Tactile Sensors
Li, Na
Yu, Shuaishuai
Zhao, Lei
Zhang, Pengfei
Wang, Ziqi
Wei, Zhiting
Chen, Wenbo
Xu, Xuhui
INORGANIC CHEMISTRY, 2023, 62 (05) : 2024 - 2032
[48] A Stretchable and Transparent Electrode Based on PEGylated Silk Fibroin for In Vivo Dual-Modal Neural-Vascular Activity Probing
Cui, Yajing
Zhang, Fan
Chen, Geng
Yao, Lin
Zhang, Nan
Liu, Zhiyuan
Li, Qingsong
Zhang, Feilong
Cui, Zequn
Zhang, Keqin
Li, Peng
Cheng, Yuan
Zhang, Shaomin
Chen, Xiaodong
ADVANCED MATERIALS, 2021, 33 (34)
[49] SmartCamera: Realtime Video Stream-Oriented Action Recognition Platform in Edge Environment
Zhai, Zhongyi
Chen, Xiaofeng
Zhao, Yinduo
Zhao, Lingzhong
Wu, Jinsong
Qian, Junyan
UBICOMP/ISWC '21 ADJUNCT: PROCEEDINGS OF THE 2021 ACM INTERNATIONAL JOINT CONFERENCE ON PERVASIVE AND UBIQUITOUS COMPUTING AND PROCEEDINGS OF THE 2021 ACM INTERNATIONAL SYMPOSIUM ON WEARABLE COMPUTERS, 2021, : 88 - 89
[50] DUAL TEMPORAL TRANSFORMERS FOR FINE-GRAINED DANGEROUS ACTION RECOGNITION
Song, Wenfeng
Jin, Xingliang
Ding, Yang
Gao, Yang
Hou, Xia
2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 415 - 419

← 1 2 3 4 5 →