DINO-Tracker: Taming DINO for Self-supervised Point Tracking in a Single Video

被引:0
|
作者
Tumanyan, Narek [1 ]
Singer, Assaf [1 ]
Bagon, Shai [1 ]
Dekel, Tali [1 ]
机构
[1] Weizmann Inst Sci, Rehovot, Israel
来源
关键词
D O I
10.1007/978-3-031-73347-5_21
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present DINO-Tracker - a new framework for long-term dense tracking in video. The pillar of our approach is combining test-time training on a single video, with the powerful localized semantic features learned by a pre-trained DINO-ViT model. Specifically, our framework simultaneously adopts DINO's features to fit to the motion observations of the test video, while training a tracker that directly leverages the refined features. The entire framework is trained end-to-end using a combination of self-supervised losses, and regularization that allows us to retain and benefit from DINO's semantic prior. Extensive evaluation demonstrates that our method achieves state-of-the-art results on known benchmarks. DINO-tracker significantly outperforms self-supervised methods and is competitive with state-of-the-art supervised trackers, while outperforming them in challenging cases of tracking under long-term occlusions.
引用
收藏
页码:367 / 385
页数:19
相关论文
共 50 条
  • [31] Self-Supervised Video Defocus Deblurring with Atlas Learning
    Ruan, Lingyan
    Balint, Martin
    Bemana, Mojtaba
    Wolski, Krzysztof
    Seidel, Hans-Peter
    Myszkowski, Karol
    Chen, Bin
    PROCEEDINGS OF SIGGRAPH 2024 CONFERENCE PAPERS, 2024,
  • [32] Self-Supervised Deep TripleNet for Video Object Segmentation
    Xu, Kai
    Wen, Longyin
    Li, Guorong
    Huang, Qingming
    IEEE TRANSACTIONS ON MULTIMEDIA, 2021, 23 : 3530 - 3539
  • [33] Contrast and Order Representations for Video Self-supervised Learning
    Hu, Kai
    Shao, Jie
    Liu, Yuan
    Raj, Bhiksha
    Savvides, Marios
    Shen, Zhiqiang
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 7919 - 7929
  • [34] Self-supervised Dance Video Synthesis Conditioned on Music
    Ren, Xuanchi
    Li, Haoran
    Huang, Zijian
    Chen, Qifeng
    MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 46 - 54
  • [35] Self-Supervised Representation Learning for Video Quality Assessment
    Jiang, Shaojie
    Sang, Qingbing
    Hu, Zongyao
    Liu, Lixiong
    IEEE TRANSACTIONS ON BROADCASTING, 2023, 69 (01) : 118 - 129
  • [36] SVFAP: Self-Supervised Video Facial Affect Perceiver
    Sun, Licai
    Lian, Zheng
    Wang, Kexin
    He, Yu
    Xu, Mingyu
    Sun, Haiyang
    Liu, Bin
    Tao, Jianhua
    IEEE Transactions on Affective Computing, 2025, 16 (01): : 405 - 422
  • [37] Self-Supervised Graph Convolution for Video Moment Retrieval
    Hu, Xiwen
    Wang, Guolong
    Shan, Shimin
    Liu, Yu
    Li, Jiangquan
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2023, PART X, 2023, 14263 : 407 - 419
  • [38] Self-supervised Video Hashing via Bidirectional Transformers
    Li, Shuyan
    Li, Xiu
    Lu, Jiwen
    Zhou, Jie
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 13544 - 13553
  • [39] Contrastive Masked Autoencoders for Self-Supervised Video Hashing
    Wang, Yuting
    Wang, Jinpeng
    Chen, Bin
    Zeng, Ziyun
    Xia, Shu-Tao
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 3, 2023, : 2733 - 2741
  • [40] Self-Supervised Temporal Sensitive Hashing for Video Retrieval
    Li, Qihua
    Tian, Xing
    Ng, Wing W. Y.
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 9021 - 9035