Weakly supervised temporal action localization via a multimodal feature map diffusion process

被引:0
作者
Zou, Yuanbing [1 ]
Zhao, Qingjie [1 ]
Li, Shanshan [2 ]
机构
[1] Beijing Inst Technol, Sch Comp Sci & Technol, Beijing 100081, Peoples R China
[2] Beijing Jinghang Res Inst Comp & Commun, Beijing 100074, Peoples R China
关键词
Temporal action localization; Weakly-supervised learning; Diffusion models; Multimodel feature fusion;
D O I
10.1016/j.engappai.2025.111044
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
With the continuous growth of massive video data, understanding video content has become increasingly important. Weakly supervised temporal action localization (WTAL), as a critical task, has received significant attention. The goal of WTAL is to learn temporal class activation maps (TCAMs) using only video-level annotations and perform temporal action localization via post-processing steps. However, due to the lack of detailed behavioral information in video-level annotations, the separability between foreground and background in the learned TCAM is poor, leading to incomplete action predictions. To this end, we leverage the inherent advantages of the Contrastive Language-Image Pre-training (CLIP) model in generating high-semantic visual features. By integrating CLIP-based visual information, we further enhance the representational capability of action features. We propose a novel multimodal feature map generation method based on diffusion models to fully exploit the complementary relationships between modalities. Specifically, we design a hard masking strategy to generate hard masks, which are then used as frame-level pseudo-ground truth inputs for the diffusion model. These masks are used to convey human behavior knowledge, enhancing the model's generative capacity. Subsequently, the concatenated multimodal feature maps are employed as conditional inputs to guide the generation of diffusion feature maps. This design enables the model to extract rich action cues from diverse modalities. Experimental results demonstrate that our approach achieves state-of-the-art performance on two popular benchmarks. These results highlight the proposed method's capability to achieve precise and efficient temporal action detection under weak supervision, making a significant contribution to the advancement in large-scale video data analysis.
引用
收藏
页数:14
相关论文
共 88 条
[11]   Cross-modal Consensus Network forWeakly Supervised Temporal Action Localization [J].
Hong, Fa-Ting ;
Feng, Jia-Chang ;
Xu, Dan ;
Shan, Ying ;
Zheng, Wei-Shi .
PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, :1591-1599
[12]   Weakly-Supervised Temporal Action Localization with Multi-Modal Plateau Transformers [J].
Hu, Xin ;
Li, Kai ;
Patel, Deep ;
Kruus, Erik ;
Min, Martin Renqiang ;
Ding, Zhengming .
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW, 2024, :2704-2713
[13]   Weakly Supervised Temporal Action Localization via Representative Snippet Knowledge Propagation [J].
Huang, Linjiang ;
Wang, Liang ;
Li, Hongsheng .
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :3262-3271
[14]  
Islam A, 2021, AAAI CONF ARTIF INTE, V35, P1637
[15]  
Jiang Y.-G., 2014, THUMOS challenge:Action recognition with a large number of classes
[16]   Distilling Vision-Language Pre-training to Collaborate with Weakly-Supervised Temporal Action Localization [J].
Ju, Chen ;
Zheng, Kunhao ;
Liu, Jinxiang ;
Zhao, Peisen ;
Zhang, Ya ;
Chang, Jianlong ;
Tian, Qi ;
Wang, Yanfeng .
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, :14751-14762
[17]  
King DB, 2015, ACS SYM SER, V1214, P1, DOI 10.1021/bk-2015-1214.ch001
[18]  
Lee P, 2021, AAAI CONF ARTIF INTE, V35, P1854
[19]  
Lee P, 2020, AAAI CONF ARTIF INTE, V34, P11320
[20]   Boosting Weakly-Supervised Temporal Action Localization with Text Information [J].
Li, Guozhang ;
Cheng, De ;
Ding, Xinpeng ;
Wang, Nannan ;
Wang, Xiaoyu ;
Gao, Xinbo .
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, :10648-10657