Diffusion-based framework for weakly-supervised temporal action localization

被引：0

作者：

Zou, Yuanbing ^{[1
]}

Zhao, Qingjie ^{[1
]}

Sarker, Prodip Kumar ^{[1
]}

Li, Shanshan ^{[1
]}

Wang, Lei ^{[2
]}

Liu, Wangwang ^{[2
]}

机构：

[1] School of Computer Science and Technology, Beijing Institute of Technology, Beijing

[2] Beijing Institute of Control Engineering, Beijing

来源：

Pattern Recognition | 2025年 / 160卷

关键词：

Diffusion; Mask learning; Temporal action localization; Weakly-supervised learning;

D O I：

10.1016/j.patcog.2024.111207

中图分类号：

学科分类号：

摘要：

Weakly supervised temporal action localization aims to localize action instances with only video-level supervision. Due to the absence of frame-level annotation supervision, how effectively separate action snippets and backgrounds from semantically ambiguous features becomes an arduous challenge for this task. To address this issue from a generative modeling perspective, we propose a novel diffusion-based network with two stages. Firstly, we design a local masking mechanism module to learn the local semantic information and generate binary masks at the early stage, which (1) are used to perform action-background separation and (2) serve as pseudo-ground truth required by the diffusion module. Then, we propose a diffusion module to generate high-quality action predictions under the pseudo-ground truth supervision in the second stage. In addition, we further optimize the new-refining operation in the local masking module to improve the operation efficiency. The experimental results demonstrate that the proposed method achieves a promising performance on the publicly available mainstream datasets THUMOS14 and ActivityNet. The code is available at https://github.com/Rlab123/action_diff. © 2024

引用

共 47 条

[1]

Gao J., Zhang T., Xu C., Learning to model relationships for zero-shot video classification, IEEE Trans Pattern Anal. Mach. Intell., 43, 10, pp. 3476-3491, (2020)

[2]

Hu Y., Gao J., Dong J., Fan B., Liu H., Exploring rich semantics for open-set action recognition, IEEE Trans. Multimed., (2023)

[3]

Zhang Y., Zhang X.-Y., Shi H., OW-TAL: learning unknown human activities for open-world temporal action localization, Pattern Recognit., 133, (2023)

[4]

Kim Y.H., Nam S., Kim S.J., 2PESNet: Towards online processing of temporal action localization, Pattern Recognit., 131, (2022)

[5]

Lin T., Zhao X., Su H., Wang C., Yang M., BSN: Boundary sensitive network for temporal action proposal generation, pp. 3-19, (2018)

[6]

Long F., Yao T., Qiu Z., Tian X., Luo J., Mei T., Gaussian temporal awareness networks for action localization, pp. 344-353, (2019)

[7]

Hong F.-T., Feng J.-C., Xu D., Shan Y., Zheng W.-S., Cross-modal consensus network for weakly supervised temporal action localization, pp. 1591-1599, (2021)

[8]

Gao J., Chen M., Xu C., Vectorized evidential learning for weakly-supervised temporal action localization, IEEE Trans Pattern Anal. Mach. Intell., (2023)

[9]

Zhang X.-Y., Shi H., Li C., Li P., Li Z., Ren P., Weakly-supervised action localization via embedding-modeling iterative optimization, Pattern Recognit., 113, (2021)

[10]

Chen M., Gao J., Yang S., Xu C., Dual-evidential learning for weakly-supervised temporal action localization, European Conference on Computer Vision, pp. 192-208, (2022)

← 1 2 3 4 5 →