MODAL CONSENSUS AND CONTEXTUAL SEPARATION FOR WEAKLY SUPERVISED TEMPORAL ACTION LOCALIZATION

被引:0
作者
Liu, Peng [1 ]
Wang, Chuanxu [1 ]
Zhao, Min [1 ]
机构
[1] Qingdao Univ Sci & Technol, Qingdao 266061, Peoples R China
来源
2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, ICASSP 2024 | 2024年
关键词
Weakly supervised learning; Temporal action localization; Cross-modal collaboration; Spatiotemporal self-attention; Hybrid modeling mechanism;
D O I
10.1109/ICASSP48485.2024.10446233
中图分类号
学科分类号
摘要
Weakly-supervised Temporal Action Localization (W-TAL) is a challenging task aiming to achieve both action class identification and localization of temporal boundaries using video-level label learning. Recent methods resort to basic cascading or integration of appearance and optical flow features, often resulting in incomplete action localization and ambiguity distinguishing foreground from background. Therefore, this paper introduces the Modal Consensus and Context Separation (MCCS) approach to address these complexities. First, the modal collaboration module proposes to enhance action feature representation by synergizing appearance and optical flow features while discarding redundant elements to eschew suboptimal outcomes. Further, these augmented bimodal streams are meticulously fused via the spatiotemporal self-attention module, which adeptly fuses spatial and temporal relationships of action snippets. In addition, the hybrid modeling mechanism is employed for foreground-background separation, focusing on local action attributes within hybrid features to refine the differentiation between foreground and background. This paper substantiates the efficacy of the MCCS method through rigorous testing on the THUMOS14 and ActivityNet1.3 datasets, demonstrating its superiority in tackling the intricate facets of W-TAL.
引用
收藏
页码:4220 / 4224
页数:5
相关论文
共 21 条
  • [1] Trajectory-Based Surveillance Analysis: A Survey
    Ahmed, Sk Arif
    Dogra, Debi Prosad
    Kar, Samarjit
    Roy, Partha Pratim
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2019, 29 (07) : 1985 - 1997
  • [2] Heilbron FC, 2015, PROC CVPR IEEE, P961, DOI 10.1109/CVPR.2015.7298698
  • [3] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
    Carreira, Joao
    Zisserman, Andrew
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 4724 - 4733
  • [4] Rethinking the Faster R-CNN Architecture for Temporal Action Localization
    Chao, Yu-Wei
    Vijayanarasimhan, Sudheendra
    Seybold, Bryan
    Ross, David A.
    Deng, Jia
    Sukthankar, Rahul
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 1130 - 1139
  • [5] Chen Zhengyan, 2022, IEEE T MULTIMEDIA
  • [6] Cross-modal Consensus Network forWeakly Supervised Temporal Action Localization
    Hong, Fa-Ting
    Feng, Jia-Chang
    Xu, Dan
    Shan, Ying
    Zheng, Wei-Shi
    [J]. PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 1591 - 1599
  • [7] Weakly Supervised Temporal Action Localization via Representative Snippet Knowledge Propagation
    Huang, Linjiang
    Wang, Liang
    Li, Hongsheng
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 3262 - 3271
  • [8] A comprehensive survey of multi-view video summarization
    Hussain, Tanveer
    Muhammad, Khan
    Ding, Weiping
    Lloret, Jaime
    Baik, Sung Wook
    de Albuquerque, Victor Hugo C.
    [J]. PATTERN RECOGNITION, 2021, 109
  • [9] The THUMOS challenge on action recognition for videos "in the wild"
    Idrees, Haroon
    Zamir, Amir R.
    Jiang, Yu-Gang
    Gorban, Alex
    Laptev, Ivan
    Sukthankar, Rahul
    Shah, Mubarak
    [J]. COMPUTER VISION AND IMAGE UNDERSTANDING, 2017, 155 : 1 - 23
  • [10] Kingma D.P., 2014, arXiv, DOI [DOI 10.48550/ARXIV.1412.6980, 10.48550/arXiv.1412.6980]