Weakly supervised temporal action localization via a multimodal feature map diffusion process

被引：0

作者：

Zou, Yuanbing ^{[1
]}

Zhao, Qingjie ^{[1
]}

Li, Shanshan ^{[2
]}

机构：

[1] Beijing Inst Technol, Sch Comp Sci & Technol, Beijing 100081, Peoples R China

[2] Beijing Jinghang Res Inst Comp & Commun, Beijing 100074, Peoples R China

来源：

ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE | 2025年 / 156卷

关键词：

Temporal action localization; Weakly-supervised learning; Diffusion models; Multimodel feature fusion;

D O I：

10.1016/j.engappai.2025.111044

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

With the continuous growth of massive video data, understanding video content has become increasingly important. Weakly supervised temporal action localization (WTAL), as a critical task, has received significant attention. The goal of WTAL is to learn temporal class activation maps (TCAMs) using only video-level annotations and perform temporal action localization via post-processing steps. However, due to the lack of detailed behavioral information in video-level annotations, the separability between foreground and background in the learned TCAM is poor, leading to incomplete action predictions. To this end, we leverage the inherent advantages of the Contrastive Language-Image Pre-training (CLIP) model in generating high-semantic visual features. By integrating CLIP-based visual information, we further enhance the representational capability of action features. We propose a novel multimodal feature map generation method based on diffusion models to fully exploit the complementary relationships between modalities. Specifically, we design a hard masking strategy to generate hard masks, which are then used as frame-level pseudo-ground truth inputs for the diffusion model. These masks are used to convey human behavior knowledge, enhancing the model's generative capacity. Subsequently, the concatenated multimodal feature maps are employed as conditional inputs to guide the generation of diffusion feature maps. This design enables the model to extract rich action cues from diverse modalities. Experimental results demonstrate that our approach achieves state-of-the-art performance on two popular benchmarks. These results highlight the proposed method's capability to achieve precise and efficient temporal action detection under weak supervision, making a significant contribution to the advancement in large-scale video data analysis.

引用

页数：14

共 88 条

[1] A hybrid attention-guided ConvNeXt-GRU network for action recognition [J].

An, Yiyuan ;

Yi, Yingmin ;

Han, Xiaoyong ;

Wu, Li ;

Su, Chunyi ;

Liu, Bojun ;

Xue, Xianghong ;

Li, Yankai .

ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2024, 133

[2]

Austin J, 2021, ADV NEUR IN

[3] Soft-NMS - Improving Object Detection With One Line of Code [J].

Bodla, Navaneeth ;

Singh, Bharat ;

Chellappa, Rama ;

Davis, Larry S. .

2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :5562-5570

[4]

Heilbron FC, 2015, PROC CVPR IEEE, P961, DOI 10.1109/CVPR.2015.7298698

[5] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].

Carreira, Joao ;

Zisserman, Andrew .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733

[6] Dual-Evidential Learning for Weakly-supervised Temporal Action Localization [J].

Chen, Mengyuan ;

Gao, Junyu ;

Yang, Shicai ;

Xu, Changsheng .

COMPUTER VISION - ECCV 2022, PT IV, 2022, 13664 :192-208

[7] Human detection using oriented histograms of flow and appearance [J].

Dalal, Navneet ;

Triggs, Bill ;

Schmid, Cordelia .

COMPUTER VISION - ECCV 2006, PT 2, PROCEEDINGS, 2006, 3952 :428-441

[8] Multimodal Motion Conditioned Diffusion Model for Skeleton-based Video Anomaly Detection [J].

Flaborea, Alessandro ;

Collorone, Luca ;

di Melendugno, Guido Maria D'Amely ;

D'Arrigo, Stefano ;

Prenkaj, Bardh ;

Galasso, Fabio .

2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, :10284-10295

[9] ASM-Loc: Action-aware Segment Modeling for Weakly-Supervised Temporal Action Localization [J].

He, Bo ;

Yang, Xitong ;

Kang, Le ;

Cheng, Zhiyu ;

Zhou, Xin ;

Shrivastava, Abhinav .

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, :13915-13925

[10]

Ho J., 2020, Advances in neural information processing systems, V33, P6840, DOI 10.48550/arXiv.2006.11239

← 1 2 3 4 5 6 7 8 9 →