Diffusion-based framework for weakly-supervised temporal action localization

被引:0
作者
Zou, Yuanbing [1 ]
Zhao, Qingjie [1 ]
Sarker, Prodip Kumar [1 ]
Li, Shanshan [1 ]
Wang, Lei [2 ]
Liu, Wangwang [2 ]
机构
[1] School of Computer Science and Technology, Beijing Institute of Technology, Beijing
[2] Beijing Institute of Control Engineering, Beijing
关键词
Diffusion; Mask learning; Temporal action localization; Weakly-supervised learning;
D O I
10.1016/j.patcog.2024.111207
中图分类号
学科分类号
摘要
Weakly supervised temporal action localization aims to localize action instances with only video-level supervision. Due to the absence of frame-level annotation supervision, how effectively separate action snippets and backgrounds from semantically ambiguous features becomes an arduous challenge for this task. To address this issue from a generative modeling perspective, we propose a novel diffusion-based network with two stages. Firstly, we design a local masking mechanism module to learn the local semantic information and generate binary masks at the early stage, which (1) are used to perform action-background separation and (2) serve as pseudo-ground truth required by the diffusion module. Then, we propose a diffusion module to generate high-quality action predictions under the pseudo-ground truth supervision in the second stage. In addition, we further optimize the new-refining operation in the local masking module to improve the operation efficiency. The experimental results demonstrate that the proposed method achieves a promising performance on the publicly available mainstream datasets THUMOS14 and ActivityNet. The code is available at https://github.com/Rlab123/action_diff. © 2024
引用
收藏
相关论文
共 47 条
  • [1] Gao J., Zhang T., Xu C., Learning to model relationships for zero-shot video classification, IEEE Trans Pattern Anal. Mach. Intell., 43, 10, pp. 3476-3491, (2020)
  • [2] Hu Y., Gao J., Dong J., Fan B., Liu H., Exploring rich semantics for open-set action recognition, IEEE Trans. Multimed., (2023)
  • [3] Zhang Y., Zhang X.-Y., Shi H., OW-TAL: learning unknown human activities for open-world temporal action localization, Pattern Recognit., 133, (2023)
  • [4] Kim Y.H., Nam S., Kim S.J., 2PESNet: Towards online processing of temporal action localization, Pattern Recognit., 131, (2022)
  • [5] Lin T., Zhao X., Su H., Wang C., Yang M., BSN: Boundary sensitive network for temporal action proposal generation, pp. 3-19, (2018)
  • [6] Long F., Yao T., Qiu Z., Tian X., Luo J., Mei T., Gaussian temporal awareness networks for action localization, pp. 344-353, (2019)
  • [7] Hong F.-T., Feng J.-C., Xu D., Shan Y., Zheng W.-S., Cross-modal consensus network for weakly supervised temporal action localization, pp. 1591-1599, (2021)
  • [8] Gao J., Chen M., Xu C., Vectorized evidential learning for weakly-supervised temporal action localization, IEEE Trans Pattern Anal. Mach. Intell., (2023)
  • [9] Zhang X.-Y., Shi H., Li C., Li P., Li Z., Ren P., Weakly-supervised action localization via embedding-modeling iterative optimization, Pattern Recognit., 113, (2021)
  • [10] Chen M., Gao J., Yang S., Xu C., Dual-evidential learning for weakly-supervised temporal action localization, European Conference on Computer Vision, pp. 192-208, (2022)