Leveraging Transformers for Weakly Supervised Object Localization in Unconstrained Videos

被引：0

作者：

Murtaza, Shakeeb ^{[1
]}

Pedersoli, Marco ^{[1
]}

Sarraf, Aydin ^{[2
]}

Granger, Eric ^{[1
]}

机构：

[1] ETS Montreal, Dept Syst Engn, LIVIA, Montreal, PQ, Canada

[2] Ericsson, Global AI Accelerator, Montreal, PQ, Canada

来源：

ARTIFICIAL NEURAL NETWORKS IN PATTERN RECOGNITION, ANNPR 2024 | 2024年 / 15154卷

关键词：

D O I：

10.1007/978-3-031-71602-7_17

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Weakly-Supervised Video Object Localization (WSVOL) involves localizing an object in videos using only video-level labels, also referred to as tags. State-of-the-art WSVOL methods like Temporal CAM (TCAM) rely on class activation mapping (CAM) and typically require a pre-trained CNN classifier. However, their localization accuracy is affected by their tendency to minimize the mutual information between different instances of a class and exploit temporal information during training for downstream tasks, e.g., detection and tracking. In the absence of bounding box annotation, it is challenging to exploit precise information about objects from temporal cues because the model struggles to locate objects over time. To address these issues, a novel method called transformer based CAM for videos (TrCAM-V), is proposed for WSVOL. It consists of a DeiT backbone with two heads for classification and localization. The classification head is trained using standard classification loss (CL), while the localization head is trained using pseudo-labels that are extracted using a pre-trained CLIP model. From these pseudo-labels, the high and low activation values are considered to be foreground and background regions, respectively. Our TrCAM-V method allows training a localization network by sampling pseudo-pixels on the fly from these regions. Additionally, a conditional random field (CRF) loss is employed to align the object boundaries with the foreground map. During inference, the model can process individual frames for real-time localization applications. Extensive experiments on challenging YouTube-Objects unconstrained video datasets show that our TrCAM-V method achieves new state-of-the-art performance in terms of classification and localization accuracy. Code: https://github.com/shakeebmurtaza/TrCAM/.

引用

页码：195 / 207

页数：13

共 50 条

[1] Adversarial Transformers for Weakly Supervised Object Localization
Meng, Meng
Zhang, Tianzhu
Zhang, Zhe
Zhang, Yongdong
Wu, Feng
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 7130 - 7143
[2] Adversarial Transformers for Weakly Supervised Object Localization
Meng, Meng
Zhang, Tianzhu
Zhang, Zhe
Zhang, Yongdong
Wu, Feng
IEEE Transactions on Image Processing, 2022, 31 : 7130 - 7143
[3] Weakly supervised object localization and segmentation in videos
Rochan, Mrigank
Rahman, Shafin
Bruce, Neil D. B.
Wang, Yang
IMAGE AND VISION COMPUTING, 2016, 56 : 1 - 12
[4] Tracking-assisted Weakly Supervised Online Visual Object Segmentation in Unconstrained Videos
Zhang, Zongpu
Hua, Yang
Song, Tao
Xue, Zhengui
Ma, Ruhui
Robertson, Neil
Guan, Haibing
PROCEEDINGS OF THE 2018 ACM MULTIMEDIA CONFERENCE (MM'18), 2018, : 941 - 949
[5] Discriminative Sampling of Proposals in Self-Supervised Transformers for Weakly Supervised Object Localization
Murtaza, Shakeeb
Belharbi, Soufiane
Pedersoli, Marco
Sarraf, Aydin
Granger, Eric
2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION WORKSHOPS (WACVW), 2023, : 155 - 165
[6] Leveraging orientation for weakly supervised object detection with application to firearm localization
Iqbal, Javed
Munir, Muhammad Akhtar
Mahmood, Arif
Ali, Afsheen Rafaqat
Ali, Mohsen
NEUROCOMPUTING, 2021, 440 : 310 - 320
[7] TCAM: Temporal Class Activation Maps for Object Localization in Weakly-Labeled Unconstrained Videos
Belharbi, Soufiane
Ben Ayed, Ismail
McCaffrey, Luke
Granger, Eric
2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, : 137 - 146
[8] Rethinking the Localization in Weakly Supervised Object Localization
Xu, Rui
Luo, Yong
Hu, Han
Du, Bo
Shen, Jialie
Wen, Yonggang
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 5484 - 5494
[9] Generalized Weakly Supervised Object Localization
Zhang, Dingwen
Guo, Guangyu
Zeng, Wenyuan
Li, Lei
Han, Junwei
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (04) : 5395 - 5406
[10] DiPS: Discriminative pseudo-label sampling with self-supervised transformers for weakly supervised object localization
Murtaza, Shakeeb
Belharbi, Soufiane
Pedersoli, Marco
Sarraf, Aydin
Granger, Eric
IMAGE AND VISION COMPUTING, 2023, 140

← 1 2 3 4 5 →