Leveraging Transformers for Weakly Supervised Object Localization in Unconstrained Videos

被引:0
|
作者
Murtaza, Shakeeb [1 ]
Pedersoli, Marco [1 ]
Sarraf, Aydin [2 ]
Granger, Eric [1 ]
机构
[1] ETS Montreal, Dept Syst Engn, LIVIA, Montreal, PQ, Canada
[2] Ericsson, Global AI Accelerator, Montreal, PQ, Canada
关键词
D O I
10.1007/978-3-031-71602-7_17
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Weakly-Supervised Video Object Localization (WSVOL) involves localizing an object in videos using only video-level labels, also referred to as tags. State-of-the-art WSVOL methods like Temporal CAM (TCAM) rely on class activation mapping (CAM) and typically require a pre-trained CNN classifier. However, their localization accuracy is affected by their tendency to minimize the mutual information between different instances of a class and exploit temporal information during training for downstream tasks, e.g., detection and tracking. In the absence of bounding box annotation, it is challenging to exploit precise information about objects from temporal cues because the model struggles to locate objects over time. To address these issues, a novel method called transformer based CAM for videos (TrCAM-V), is proposed for WSVOL. It consists of a DeiT backbone with two heads for classification and localization. The classification head is trained using standard classification loss (CL), while the localization head is trained using pseudo-labels that are extracted using a pre-trained CLIP model. From these pseudo-labels, the high and low activation values are considered to be foreground and background regions, respectively. Our TrCAM-V method allows training a localization network by sampling pseudo-pixels on the fly from these regions. Additionally, a conditional random field (CRF) loss is employed to align the object boundaries with the foreground map. During inference, the model can process individual frames for real-time localization applications. Extensive experiments on challenging YouTube-Objects unconstrained video datasets show that our TrCAM-V method achieves new state-of-the-art performance in terms of classification and localization accuracy. Code: https://github.com/shakeebmurtaza/TrCAM/.
引用
收藏
页码:195 / 207
页数:13
相关论文
共 50 条
  • [1] Adversarial Transformers for Weakly Supervised Object Localization
    Meng, Meng
    Zhang, Tianzhu
    Zhang, Zhe
    Zhang, Yongdong
    Wu, Feng
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 7130 - 7143
  • [2] Adversarial Transformers for Weakly Supervised Object Localization
    Meng, Meng
    Zhang, Tianzhu
    Zhang, Zhe
    Zhang, Yongdong
    Wu, Feng
    IEEE Transactions on Image Processing, 2022, 31 : 7130 - 7143
  • [3] Weakly supervised object localization and segmentation in videos
    Rochan, Mrigank
    Rahman, Shafin
    Bruce, Neil D. B.
    Wang, Yang
    IMAGE AND VISION COMPUTING, 2016, 56 : 1 - 12
  • [4] Tracking-assisted Weakly Supervised Online Visual Object Segmentation in Unconstrained Videos
    Zhang, Zongpu
    Hua, Yang
    Song, Tao
    Xue, Zhengui
    Ma, Ruhui
    Robertson, Neil
    Guan, Haibing
    PROCEEDINGS OF THE 2018 ACM MULTIMEDIA CONFERENCE (MM'18), 2018, : 941 - 949
  • [5] Discriminative Sampling of Proposals in Self-Supervised Transformers for Weakly Supervised Object Localization
    Murtaza, Shakeeb
    Belharbi, Soufiane
    Pedersoli, Marco
    Sarraf, Aydin
    Granger, Eric
    2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION WORKSHOPS (WACVW), 2023, : 155 - 165
  • [6] Leveraging orientation for weakly supervised object detection with application to firearm localization
    Iqbal, Javed
    Munir, Muhammad Akhtar
    Mahmood, Arif
    Ali, Afsheen Rafaqat
    Ali, Mohsen
    NEUROCOMPUTING, 2021, 440 : 310 - 320
  • [7] TCAM: Temporal Class Activation Maps for Object Localization in Weakly-Labeled Unconstrained Videos
    Belharbi, Soufiane
    Ben Ayed, Ismail
    McCaffrey, Luke
    Granger, Eric
    2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, : 137 - 146
  • [8] Rethinking the Localization in Weakly Supervised Object Localization
    Xu, Rui
    Luo, Yong
    Hu, Han
    Du, Bo
    Shen, Jialie
    Wen, Yonggang
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 5484 - 5494
  • [9] Generalized Weakly Supervised Object Localization
    Zhang, Dingwen
    Guo, Guangyu
    Zeng, Wenyuan
    Li, Lei
    Han, Junwei
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (04) : 5395 - 5406
  • [10] DiPS: Discriminative pseudo-label sampling with self-supervised transformers for weakly supervised object localization
    Murtaza, Shakeeb
    Belharbi, Soufiane
    Pedersoli, Marco
    Sarraf, Aydin
    Granger, Eric
    IMAGE AND VISION COMPUTING, 2023, 140