Enhanced RGBT Tracking Network With Semantic Generation and Historical Context

被引：0

作者：

Gao, Zhao ^{[1
]}

Zhou, Dongming ^{[1
,2
]}

Cao, Jinde ^{[3
,4
]}

Liu, Yisong ^{[1
]}

Shan, Qingqing ^{[1
]}

机构：

[1] Yunnan Univ, Schoolof Informat Sci & Engn, Kunming 650091, Peoples R China

[2] Hunan Univ Informat Technol, Schoolof Elect Sci & Engn, Changsha 410100, Peoples R China

[3] Sch Math Southeast Univ, Sch Math, Nanjing 211189, Peoples R China

[4] Purple Mt Labs, Nanjing 211111, Peoples R China

来源：

IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT | 2025年 / 74卷

基金：

中国国家自然科学基金;

关键词：

Visualization; Target tracking; Semantics; Feature extraction; Encoding; Decoding; Accuracy; Imaging; Linguistics; Data mining; Historical prompt; linguistic information; prompt learning; RGB-thermal (RGBT) tracking; MODEL;

D O I：

10.1109/TIM.2025.3551143

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Multimodal tracking is a crucial visual task that focuses on accurately locating specific targets in video frames. The primary challenge lies in effectively utilizing visual features to identify relevant positions. Existing methods often rely on advanced visual encoders and decoders to extract features from both visible and other modalities. However, due to the limited availability of multimodal data, relying solely on visual information is insufficient. Inspired by vision-language models, we propose the RGB-thermal (RGBT) tracking network with semantic generation and historical context (SHT). This approach addresses the lack of linguistic information in visual tracking and explores the semantic relationships between the target and its search area. Our approach utilizes large models to generate image descriptions, enhancing the target's appearance information. Furthermore, it introduces the detail text visual focus (DTVF) module to improve the consistency between visual and textual data. In addition, we present a historical prompt generation method that combines historical foreground masks with visual features to provide precise cues for tracking purposes. The experimental results show that incorporating image descriptions and historical information significantly enhances multimodal tracking performance.

引用

页数：17

共 76 条

[1] HIPTrack: Visual Tracking with Historical Prompts [J].

Cai, Wenrui ;

Liu, Qingjie ;

Wang, Yunhong .

2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, :19258-19267

[2]

Cao B, 2024, AAAI CONF ARTIF INTE, P927

[3]

Chen Ting, 2022, ADV NEURAL INFORM PR

[4] SeqTrack: Sequence to Sequence Learning for Visual Object Tracking [J].

Chen, Xin ;

Peng, Houwen ;

Wang, Dong ;

Lu, Huchuan ;

Hu, Han .

2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, :14572-14581

[5]

Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171

[6]

Dosovitskiy A, 2021, Arxiv, DOI arXiv:2010.11929

[7] Bidirectional Relationship Inferring Network for Referring Image Localization and Segmentation [J].

Feng, Guang ;

Hu, Zhiwei ;

Zhang, Lihe ;

Sun, Jiayu ;

Lu, Huchuan .

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, 34 (05) :2246-2258

[8] RGBT Image Fusion Tracking via Sparse Trifurcate Transformer Aggregation Network [J].

Feng, Mingzheng ;

Su, Jianbo .

IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, 2024, 73 :1-10

[9] Learning Multi-Layer Attention Aggregation Siamese Network for Robust RGBT Tracking [J].

Feng, Mingzheng ;

Su, Jianbo .

IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 :3378-3391

[10] Learning reliable modal weight with transformer for robust RGBT tracking [J].

Feng, Mingzheng ;

Su, Jianbo .

KNOWLEDGE-BASED SYSTEMS, 2022, 249

← 1 2 3 4 5 6 7 8 →