Vision-Language Tracking With CLIP and Interactive Prompt Learning

被引:0
|
作者
Zhu, Hong [1 ,2 ]
Lu, Qingyang [3 ]
Xue, Lei [1 ]
Zhang, Pingping [4 ]
Yuan, Guanglin [3 ]
机构
[1] Natl Univ Def Technol, Coll Elect Engn, Hefei 230037, Peoples R China
[2] Army Artillery & Air Def Acad PLA, Anhui Key Lab Polarizat Imaging Detect Technol, Hefei 230031, Peoples R China
[3] Army Artillery & Air Def Acad PLA, Hefei 230031, Peoples R China
[4] Dalian Univ Technol, Sch Artificial Intelligence, Dalian 116024, Peoples R China
关键词
Feature extraction; Target tracking; Visualization; Foundation models; Linguistics; Semantics; Computational modeling; Tuning; Training; Encoding; Vision-language tracking; prompt learning; CLIP; layer-wise feature fusion;
D O I
10.1109/TITS.2024.3520103
中图分类号
TU [建筑科学];
学科分类号
0813 ;
摘要
Vision-language tracking is a new rising topic in intelligent transportation systems, particularly significant in autonomous driving and road surveillance. It is a task that aims to combine visual and auxiliary linguistic modalities to co-locate the target object in a video sequence. Currently, multi-modal data scarcity and burdensome modality fusion have become two major factors in limiting the development of vision-language tracking. To tackle the issues, we propose an efficient and effective one-stage vision-language tracking framework (CPIPTrack) that unifies feature extraction and multi-modal fusion by interactive prompt learning. Feature extraction is performed by the high-performance vision-language foundation model CLIP, resulting in the impressive generalization ability inherited from the large-scale model. Modality fusion is simplified to a few lightweight prompts, leading to significant savings in computational resources. Specifically, we design three types of prompts to dynamically learn the layer-wise feature relationships between vision and language, facilitating rich context interactions while enabling the pre-trained CLIP adaptation. In this manner, discriminative target-oriented visual features can be extracted by language and template guidance, which are used for subsequent reasoning. Due to the elimination of extra heavy modality fusion, the proposed CPIPTrack shows high efficiency in both training and inference. CPIPTrack has been extensively evaluated on three benchmark datasets, and the experimental results demonstrate that it achieves a good performance-speed balance with an AUC of 66.0% on LaSOT and a runtime of 51.7 FPS on RTX2080 Super.
引用
收藏
页码:3659 / 3670
页数:12
相关论文
共 50 条
  • [1] Learning to Prompt for Vision-Language Models
    Zhou, Kaiyang
    Yang, Jingkang
    Loy, Chen Change
    Liu, Ziwei
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2022, 130 (09) : 2337 - 2348
  • [2] Learning to Prompt for Vision-Language Models
    Kaiyang Zhou
    Jingkang Yang
    Chen Change Loy
    Ziwei Liu
    International Journal of Computer Vision, 2022, 130 : 2337 - 2348
  • [3] Conditional Prompt Learning for Vision-Language Models
    Zhou, Kaiyang
    Yang, Jingkang
    Loy, Chen Change
    Liu, Ziwei
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 16795 - 16804
  • [4] Consistent prompt learning for vision-language models
    Zhang, Yonggang
    Tian, Xinmei
    KNOWLEDGE-BASED SYSTEMS, 2025, 310
  • [5] Learning to Prompt for Vision-Language Emotion Recognition
    Xie, Hongxia
    Chung, Hua
    Shuai, Hong-Han
    Cheng, Wen-Huang
    2023 11TH INTERNATIONAL CONFERENCE ON AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION WORKSHOPS AND DEMOS, ACIIW, 2023,
  • [6] Cascade Prompt Learning for Vision-Language Model Adaptation
    Wu, Ge
    Zhang, Xin
    Li, Zheng
    Chen, Zhaowei
    Liang, Jiajun
    Yang, Jian
    Li, Xiang
    COMPUTER VISION - ECCV 2024, PT L, 2025, 15108 : 304 - 321
  • [7] CoPL: Contextual Prompt Learning for Vision-Language Understanding
    Goswami, Koustava
    Karanam, Srikrishna
    Udhayanan, Prateksha
    Joseph, K. J.
    Srinivasan, Balaji Vasan
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 16, 2024, : 18090 - 18098
  • [8] Learning Domain Invariant Prompt for Vision-Language Models
    Zhao, Cairong
    Wang, Yubin
    Jiang, Xinyang
    Shen, Yifei
    Song, Kaitao
    Li, Dongsheng
    Miao, Duoqian
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2024, 33 : 1348 - 1360
  • [9] DC-CLIP: Multilingual CLIP Compression via vision-language distillation and vision-language alignment
    Zhang, Wenbo
    Zhang, Yifan
    Lin, Jianfeng
    Huang, Binqiang
    Zhang, Jinlu
    Yu, Wenhao
    PATTERN RECOGNITION, 2025, 164
  • [10] A Slim Prompt-Averaged Consistency prompt learning for vision-language model
    He, Siyu
    Wang, Shengsheng
    Long, Sifan
    KNOWLEDGE-BASED SYSTEMS, 2025, 310