Vision-Language Tracking With CLIP and Interactive Prompt Learning

被引：0

作者：

Zhu, Hong ^{[1
,2
]}

Lu, Qingyang ^{[3
]}

Xue, Lei ^{[1
]}

Zhang, Pingping ^{[4
]}

Yuan, Guanglin ^{[3
]}

机构：

[1] Natl Univ Def Technol, Coll Elect Engn, Hefei 230037, Peoples R China

[2] Army Artillery & Air Def Acad PLA, Anhui Key Lab Polarizat Imaging Detect Technol, Hefei 230031, Peoples R China

[3] Army Artillery & Air Def Acad PLA, Hefei 230031, Peoples R China

[4] Dalian Univ Technol, Sch Artificial Intelligence, Dalian 116024, Peoples R China

来源：

IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS | 2025年 / 26卷 / 03期

关键词：

Feature extraction; Target tracking; Visualization; Foundation models; Linguistics; Semantics; Computational modeling; Tuning; Training; Encoding; Vision-language tracking; prompt learning; CLIP; layer-wise feature fusion;

D O I：

10.1109/TITS.2024.3520103

中图分类号：

TU [建筑科学];

学科分类号：

0813 ;

摘要：

Vision-language tracking is a new rising topic in intelligent transportation systems, particularly significant in autonomous driving and road surveillance. It is a task that aims to combine visual and auxiliary linguistic modalities to co-locate the target object in a video sequence. Currently, multi-modal data scarcity and burdensome modality fusion have become two major factors in limiting the development of vision-language tracking. To tackle the issues, we propose an efficient and effective one-stage vision-language tracking framework (CPIPTrack) that unifies feature extraction and multi-modal fusion by interactive prompt learning. Feature extraction is performed by the high-performance vision-language foundation model CLIP, resulting in the impressive generalization ability inherited from the large-scale model. Modality fusion is simplified to a few lightweight prompts, leading to significant savings in computational resources. Specifically, we design three types of prompts to dynamically learn the layer-wise feature relationships between vision and language, facilitating rich context interactions while enabling the pre-trained CLIP adaptation. In this manner, discriminative target-oriented visual features can be extracted by language and template guidance, which are used for subsequent reasoning. Due to the elimination of extra heavy modality fusion, the proposed CPIPTrack shows high efficiency in both training and inference. CPIPTrack has been extensively evaluated on three benchmark datasets, and the experimental results demonstrate that it achieves a good performance-speed balance with an AUC of 66.0% on LaSOT and a runtime of 51.7 FPS on RTX2080 Super.

引用

页码：3659 / 3670

页数：12

共 50 条

[1] Learning to Prompt for Vision-Language Models
Zhou, Kaiyang
Yang, Jingkang
Loy, Chen Change
Liu, Ziwei
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2022, 130 (09) : 2337 - 2348
[2] Learning to Prompt for Vision-Language Models
Kaiyang Zhou
Jingkang Yang
Chen Change Loy
Ziwei Liu
International Journal of Computer Vision, 2022, 130 : 2337 - 2348
[3] Conditional Prompt Learning for Vision-Language Models
Zhou, Kaiyang
Yang, Jingkang
Loy, Chen Change
Liu, Ziwei
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 16795 - 16804
[4] Consistent prompt learning for vision-language models
Zhang, Yonggang
Tian, Xinmei
KNOWLEDGE-BASED SYSTEMS, 2025, 310
[5] Learning to Prompt for Vision-Language Emotion Recognition
Xie, Hongxia
Chung, Hua
Shuai, Hong-Han
Cheng, Wen-Huang
2023 11TH INTERNATIONAL CONFERENCE ON AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION WORKSHOPS AND DEMOS, ACIIW, 2023,
[6] Cascade Prompt Learning for Vision-Language Model Adaptation
Wu, Ge
Zhang, Xin
Li, Zheng
Chen, Zhaowei
Liang, Jiajun
Yang, Jian
Li, Xiang
COMPUTER VISION - ECCV 2024, PT L, 2025, 15108 : 304 - 321
[7] CoPL: Contextual Prompt Learning for Vision-Language Understanding
Goswami, Koustava
Karanam, Srikrishna
Udhayanan, Prateksha
Joseph, K. J.
Srinivasan, Balaji Vasan
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 16, 2024, : 18090 - 18098
[8] Learning Domain Invariant Prompt for Vision-Language Models
Zhao, Cairong
Wang, Yubin
Jiang, Xinyang
Shen, Yifei
Song, Kaitao
Li, Dongsheng
Miao, Duoqian
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2024, 33 : 1348 - 1360
[9] DC-CLIP: Multilingual CLIP Compression via vision-language distillation and vision-language alignment
Zhang, Wenbo
Zhang, Yifan
Lin, Jianfeng
Huang, Binqiang
Zhang, Jinlu
Yu, Wenhao
PATTERN RECOGNITION, 2025, 164
[10] A Slim Prompt-Averaged Consistency prompt learning for vision-language model
He, Siyu
Wang, Shengsheng
Long, Sifan
KNOWLEDGE-BASED SYSTEMS, 2025, 310

← 1 2 3 4 5 →