MT-CMVAD: A Multi-Modal Transformer Framework for Cross-Modal Video Anomaly Detection

被引:0
作者
Ding, Hantao [1 ]
Lou, Shengfeng [1 ]
Ye, Hairong [1 ]
Chen, Yanbing [1 ]
机构
[1] Zhejiang Sci Tech Univ, Sch Comp Sci & Technol, Sch Artificial Intelligence, Hangzhou 310018, Peoples R China
来源
APPLIED SCIENCES-BASEL | 2025年 / 15卷 / 12期
关键词
multi-modal transformer; LoRA; video anomaly detection; self-attention mechanism; cross-modal learning;
D O I
10.3390/app15126773
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
Video anomaly detection (VAD) faces significant challenges in multimodal semantic alignment and long-term temporal modeling within open surveillance scenarios. Existing methods are often plagued by modality discrepancies and fragmented temporal reasoning. To address these issues, we introduce MT-CMVAD, a hierarchically structured Transformer architecture that makes two key technical contributions: (1) A Context-Aware Dynamic Fusion Module that leverages cross-modal attention with learnable gating coefficients to effectively bridge the gap between RGB and optical flow modalities through adaptive feature recalibration, significantly enhancing fusion performance; (2) A Multi-Scale Spatiotemporal Transformer that establishes global-temporal dependencies via dilated attention mechanisms while preserving local spatial semantics through pyramidal feature aggregation. To address the sparse anomaly supervision dilemma, we propose a hybrid learning objective that integrates dual-stream reconstruction loss with prototype-based contrastive discrimination, enabling the joint optimization of pattern restoration and discriminative representation learning. Our extensive experiments on the UCF-Crime, UBI-Fights, and UBnormal datasets demonstrate state-of-the-art performance, achieving AUC scores of 98.9%, 94.7%, and 82.9%, respectively. The explicit spatiotemporal encoding scheme further improves temporal alignment accuracy by 2.4%, contributing to enhanced anomaly localization and overall detection accuracy. Additionally, the proposed framework achieves a 14.3% reduction in FLOPs and demonstrates 18.7% faster convergence during training, highlighting its practical value for real-world deployment. Our optimized window-shift attention mechanism also reduces computational complexity, making MT-CMVAD a robust and efficient solution for safety-critical video understanding tasks.
引用
收藏
页数:20
相关论文
共 45 条
[1]   Latent Space Autoregression for Novelty Detection [J].
Abati, Davide ;
Porrello, Angelo ;
Calderara, Simone ;
Cucchiara, Rita .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :481-490
[2]  
Abbas ZK, 2022, PROCEEDING OF THE 2ND 2022 INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND SOFTWARE ENGINEERING (CSASE 2022), P30
[3]  
Ahn S., 2024, Computer VisionACCV 2024, VVolume 15474, P312, DOI [10.1007/978-981-96-0908-618, DOI 10.1007/978-981-96-0908-618]
[4]   An Effective Video Transformer With Synchronized Spatiotemporal and Spatial Self-Attention for Action Recognition [J].
Alfasly, Saghir ;
Chui, Charles K. ;
Jiang, Qingtang ;
Lu, Jian ;
Xu, Chen .
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (02) :2496-2509
[5]   Video anomaly detection and localization based on appearance and motion models [J].
Aziz, Zafar ;
Bhatti, Naeem ;
Mahmood, Hasan ;
Zia, Muhammad .
MULTIMEDIA TOOLS AND APPLICATIONS, 2021, 80 (17) :25875-25895
[6]   Feed-forward neural networks [J].
Bebis, George ;
Georgiopoulos, Michael .
IEEE Potentials, 1994, 13 (04) :27-31
[7]  
Bertasius G, 2021, PR MACH LEARN RES, V139
[8]   Visualizing Transformers for NLP: A Brief Survey [J].
Brasoveanu, Adrian M. P. ;
Andonie, Razvan .
2020 24TH INTERNATIONAL CONFERENCE INFORMATION VISUALISATION (IV 2020), 2020, :270-279
[9]   Temporal Sequence Modeling for Video Event Detection [J].
Cheng, Yu ;
Fan, Quanfu ;
Pankanti, Sharath ;
Choudhary, Alok .
2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2014, :2235-2242
[10]  
Chong Yong Shean, 2017, arXiv, DOI DOI 10.48550/ARXIV.1701.01546