Transformer-based visual object tracking via fine-coarse concatenated attention and cross concatenated MLP

被引：21

作者：

Gao, Long ^{[1
]}

Chen, Langkun ^{[1
]}

Liu, Pan ^{[1
]}

Jiang, Yan ^{[2
]}

Li, Yunsong ^{[1
]}

Ning, Jifeng ^{[3
]}

机构：

[1] Xidian Univ, State Key Lab Integrated Serv Networks, Xian 710071, Peoples R China

[2] Univ Sheffield, Dept Elect & Elect Engn, Sheffield S10 2TN, England

[3] Northwest A&F Univ, Coll Informat Engn, Yangling 712100, Peoples R China

来源：

PATTERN RECOGNITION | 2024年 / 146卷

关键词：

Visual object tracking; Transformer; Fine-coarse concatenated attention; Multi-layer perceptron; Siamese network;

D O I：

10.1016/j.patcog.2023.109964

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Transformer-based trackers have demonstrated promising performance in visual object tracking tasks. Never-theless, two drawbacks limited the potential performance improvement of transformer-based trackers. Firstly, the static receptive field of the tokens within one attention layer of the original self-attention learning neglects the multi-scale nature in the object tracking task. Secondly, the learning procedure of the multi-layer perception (MLP) in the feed forward network (FFN) is lack of local interaction information among samples. To address the above issues, a new self-attention learning method, fine-coarse concatenated attention (FCA), is proposed to learn self-attention with fine and coarse granularity information. Moreover, the cross-concatenation MLP (CC-MLP) is developed to capture local interaction information across samples. Based on the two proposed modules, a novel encoder and decoder are constructed, and augmented in an all-attention tracking algorithm, FCAT. Comprehensive experiments on popular tracking datasets, OTB2015, LaSOT, GOT-10K and TrackingNet, reveal the effectiveness of FCA and CC-MLP, and FCAT achieves the state-of-art on the datasets.

引用

页数：10

共 36 条

[1] Fully-Convolutional Siamese Networks for Object Tracking [J].

Bertinetto, Luca ;

Valmadre, Jack ;

Henriques, Joao F. ;

Vedaldi, Andrea ;

Torr, Philip H. S. .

COMPUTER VISION - ECCV 2016 WORKSHOPS, PT II, 2016, 9914 :850-865

[2] Learning Discriminative Model Prediction for Tracking [J].

Bhat, Goutam ;

Danelljan, Martin ;

Van Gool, Luc ;

Timofte, Radu .

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :6181-6190

[3] Transformer Tracking [J].

Chen, Xin ;

Yan, Bin ;

Zhu, Jiawen ;

Wang, Dong ;

Yang, Xiaoyun ;

Lu, Huchuan .

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :8122-8131

[4]

Chi C., 2020, Adv. Neural Inf. Process Syst, V33, P13564

[5] MixFormer: End-to-End Tracking with Iterative Mixed Attention [J].

Cui, Yutao ;

Jiang, Cheng ;

Wang, Limin ;

Wu, Gangshan .

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, :13598-13608

[6] Pseudo loss active learning for deep visual tracking [J].

Cui, Zhiyan ;

Lu, Na ;

Wang, Weifeng .

PATTERN RECOGNITION, 2022, 130

[7]

Dosovitskiy A, 2021, Arxiv, DOI [arXiv:2010.11929, DOI 10.48550/ARXIV.2010.11929]

[8] LaSOT: A High-quality Benchmark for Large-scale Single Object Tracking [J].

Fan, Heng ;

Lin, Liting ;

Yang, Fan ;

Chu, Peng ;

Deng, Ge ;

Yu, Sijia ;

Bai, Hexin ;

Xu, Yong ;

Liao, Chunyuan ;

Ling, Haibin .

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :5369-5378

[9] STMTrack: Template-free Visual Tracking with Space-time Memory Networks [J].

Fu, Zhihong ;

Liu, Qingjie ;

Fu, Zehua ;

Wang, Yunhong .

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :13769-13778

[10] Visual object tracking via non-local correlation attention learning [J].

Gao, Long ;

Liu, Pan ;

Ning, Jifeng ;

Li, Yunsong .

KNOWLEDGE-BASED SYSTEMS, 2022, 254

← 1 2 3 4 →