Multi-modal interaction with token division strategy for RGB-T tracking

被引:30
作者
Cai, Yujue [1 ]
Sui, Xiubao [1 ]
Gu, Guohua [1 ]
Chen, Qian [1 ]
机构
[1] Nanjing Univ Sci & Technol, Sch Elect & Opt Engn, Nanjing 210014, Peoples R China
基金
中国国家自然科学基金;
关键词
RGB-T tracking; Multi-modal fusion; Vision transformer; Cross-modal interaction; Attention masking strategy; NETWORK; FUSION;
D O I
10.1016/j.patcog.2024.110626
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
RGB-T tracking takes visible and infrared images as inputs, which is an extended application of multi -modal fusion in the field of visual object tracking. The complementarity between visible and infrared modalities can enhance the robustness of tracker in complex scenes. Cross -modal interaction can facilitate the fusion and synergy of different modalities, but most previous methods lack clear target information in multi -modal fusion, leading to some undesired cross -relation in interaction. To reduce these undesired cross -relations, we propose a Multi -modal Interaction scheme Guided by Token Division strategy (MIGTD). This scheme divides the input multi -modal tokens into several categories and restricts the interaction between tokens by setting different rules. The above operation is implemented in parallel through an attention masking strategy. To accurately classify search tokens, an instance segmentation task with box-supervised loss is employed. We conduct extensive experiments on three popular benchmark datasets, RGBT234, LasHeR and VTUAV. The experimental results indicate that the tracker proposed in this article reach the world's advanced level in performance.
引用
收藏
页数:11
相关论文
共 42 条
[1]   Learning modality feature fusion via transformer for RGBT-tracking [J].
Cai, Yujue ;
Sui, Xiubao ;
Gu, Guohua ;
Chen, Qian .
INFRARED PHYSICS & TECHNOLOGY, 2023, 133
[2]   Multi-modal multi-task feature fusion for RGBT tracking [J].
Cai, Yujue ;
Sui, Xiubao ;
Gu, Guohua .
INFORMATION FUSION, 2023, 97
[3]   Backbone is All Your Need: A Simplified Architecture for Visual Object Tracking [J].
Chen, Boyu ;
Li, Peixia ;
Bai, Lei ;
Qiao, Lei ;
Shen, Qiuhong ;
Li, Bo ;
Gan, Weihao ;
Wu, Wei ;
Ouyang, Wanli .
COMPUTER VISION, ECCV 2022, PT XXII, 2022, 13682 :375-392
[4]   Challenge-Aware RGBT Tracking [J].
Li, Chenglong ;
Liu, Lei ;
Lu, Andong ;
Ji, Qing ;
Tang, Jin .
COMPUTER VISION - ECCV 2020, PT XXII, 2020, 12367 :222-237
[5]   MixFormer: End-to-End Tracking with Iterative Mixed Attention [J].
Cui, Yutao ;
Jiang, Cheng ;
Wang, Limin ;
Wu, Gangshan .
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, :13598-13608
[6]  
Dosovitskiy A, 2021, Arxiv, DOI arXiv:2010.11929
[7]   Transformer-based visual object tracking via fine-coarse concatenated attention and cross concatenated MLP [J].
Gao, Long ;
Chen, Langkun ;
Liu, Pan ;
Jiang, Yan ;
Li, Yunsong ;
Ning, Jifeng .
PATTERN RECOGNITION, 2024, 146
[8]   Generalized Relation Modeling for Transformer Tracking [J].
Gao, Shenyuan ;
Zhou, Chunluan ;
Zhang, Jun .
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, :18686-18695
[9]   Deep Adaptive Fusion Network for High Performance RGBT Tracking [J].
Gao, Yuan ;
Li, Chenglong ;
Zhu, Yabin ;
Tang, Jin ;
He, Tao ;
Wang, Futian .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW), 2019, :91-99
[10]  
Hsu CC, 2019, ADV NEUR IN, V32