MT-CMVAD: A Multi-Modal Transformer Framework for Cross-Modal Video Anomaly Detection

被引：0

作者：

Ding, Hantao ^{[1
]}

Lou, Shengfeng ^{[1
]}

Ye, Hairong ^{[1
]}

Chen, Yanbing ^{[1
]}

机构：

[1] Zhejiang Sci Tech Univ, Sch Comp Sci & Technol, Sch Artificial Intelligence, Hangzhou 310018, Peoples R China

来源：

APPLIED SCIENCES-BASEL | 2025年 / 15卷 / 12期

关键词：

multi-modal transformer; LoRA; video anomaly detection; self-attention mechanism; cross-modal learning;

D O I：

10.3390/app15126773

中图分类号：

O6 [化学];

学科分类号：

0703 ;

摘要：

Video anomaly detection (VAD) faces significant challenges in multimodal semantic alignment and long-term temporal modeling within open surveillance scenarios. Existing methods are often plagued by modality discrepancies and fragmented temporal reasoning. To address these issues, we introduce MT-CMVAD, a hierarchically structured Transformer architecture that makes two key technical contributions: (1) A Context-Aware Dynamic Fusion Module that leverages cross-modal attention with learnable gating coefficients to effectively bridge the gap between RGB and optical flow modalities through adaptive feature recalibration, significantly enhancing fusion performance; (2) A Multi-Scale Spatiotemporal Transformer that establishes global-temporal dependencies via dilated attention mechanisms while preserving local spatial semantics through pyramidal feature aggregation. To address the sparse anomaly supervision dilemma, we propose a hybrid learning objective that integrates dual-stream reconstruction loss with prototype-based contrastive discrimination, enabling the joint optimization of pattern restoration and discriminative representation learning. Our extensive experiments on the UCF-Crime, UBI-Fights, and UBnormal datasets demonstrate state-of-the-art performance, achieving AUC scores of 98.9%, 94.7%, and 82.9%, respectively. The explicit spatiotemporal encoding scheme further improves temporal alignment accuracy by 2.4%, contributing to enhanced anomaly localization and overall detection accuracy. Additionally, the proposed framework achieves a 14.3% reduction in FLOPs and demonstrates 18.7% faster convergence during training, highlighting its practical value for real-world deployment. Our optimized window-shift attention mechanism also reduces computational complexity, making MT-CMVAD a robust and efficient solution for safety-critical video understanding tasks.

引用

页数：20

共 45 条

[1] Latent Space Autoregression for Novelty Detection [J].

Abati, Davide ;

Porrello, Angelo ;

Calderara, Simone ;

Cucchiara, Rita .

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :481-490

[2]

Abbas ZK, 2022, PROCEEDING OF THE 2ND 2022 INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND SOFTWARE ENGINEERING (CSASE 2022), P30

[3]

Ahn S., 2024, Computer VisionACCV 2024, VVolume 15474, P312, DOI [10.1007/978-981-96-0908-618, DOI 10.1007/978-981-96-0908-618]

[4] An Effective Video Transformer With Synchronized Spatiotemporal and Spatial Self-Attention for Action Recognition [J].

Alfasly, Saghir ;

Chui, Charles K. ;

Jiang, Qingtang ;

Lu, Jian ;

Xu, Chen .

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (02) :2496-2509

[5] Video anomaly detection and localization based on appearance and motion models [J].

Aziz, Zafar ;

Bhatti, Naeem ;

Mahmood, Hasan ;

Zia, Muhammad .

MULTIMEDIA TOOLS AND APPLICATIONS, 2021, 80 (17) :25875-25895

[6] Feed-forward neural networks [J].

Bebis, George ;

Georgiopoulos, Michael .

IEEE Potentials, 1994, 13 (04) :27-31

[7]

Bertasius G, 2021, PR MACH LEARN RES, V139

[8] Visualizing Transformers for NLP: A Brief Survey [J].

Brasoveanu, Adrian M. P. ;

Andonie, Razvan .

2020 24TH INTERNATIONAL CONFERENCE INFORMATION VISUALISATION (IV 2020), 2020, :270-279

[9] Temporal Sequence Modeling for Video Event Detection [J].

Cheng, Yu ;

Fan, Quanfu ;

Pankanti, Sharath ;

Choudhary, Alok .

2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2014, :2235-2242

[10]

Chong Yong Shean, 2017, arXiv, DOI DOI 10.48550/ARXIV.1701.01546

← 1 2 3 4 5 →