Transformer-based visual object tracking via fine-coarse concatenated attention and cross concatenated MLP

被引:21
作者
Gao, Long [1 ]
Chen, Langkun [1 ]
Liu, Pan [1 ]
Jiang, Yan [2 ]
Li, Yunsong [1 ]
Ning, Jifeng [3 ]
机构
[1] Xidian Univ, State Key Lab Integrated Serv Networks, Xian 710071, Peoples R China
[2] Univ Sheffield, Dept Elect & Elect Engn, Sheffield S10 2TN, England
[3] Northwest A&F Univ, Coll Informat Engn, Yangling 712100, Peoples R China
关键词
Visual object tracking; Transformer; Fine-coarse concatenated attention; Multi-layer perceptron; Siamese network;
D O I
10.1016/j.patcog.2023.109964
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Transformer-based trackers have demonstrated promising performance in visual object tracking tasks. Never-theless, two drawbacks limited the potential performance improvement of transformer-based trackers. Firstly, the static receptive field of the tokens within one attention layer of the original self-attention learning neglects the multi-scale nature in the object tracking task. Secondly, the learning procedure of the multi-layer perception (MLP) in the feed forward network (FFN) is lack of local interaction information among samples. To address the above issues, a new self-attention learning method, fine-coarse concatenated attention (FCA), is proposed to learn self-attention with fine and coarse granularity information. Moreover, the cross-concatenation MLP (CC-MLP) is developed to capture local interaction information across samples. Based on the two proposed modules, a novel encoder and decoder are constructed, and augmented in an all-attention tracking algorithm, FCAT. Comprehensive experiments on popular tracking datasets, OTB2015, LaSOT, GOT-10K and TrackingNet, reveal the effectiveness of FCA and CC-MLP, and FCAT achieves the state-of-art on the datasets.
引用
收藏
页数:10
相关论文
共 36 条
[1]   Fully-Convolutional Siamese Networks for Object Tracking [J].
Bertinetto, Luca ;
Valmadre, Jack ;
Henriques, Joao F. ;
Vedaldi, Andrea ;
Torr, Philip H. S. .
COMPUTER VISION - ECCV 2016 WORKSHOPS, PT II, 2016, 9914 :850-865
[2]   Learning Discriminative Model Prediction for Tracking [J].
Bhat, Goutam ;
Danelljan, Martin ;
Van Gool, Luc ;
Timofte, Radu .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :6181-6190
[3]   Transformer Tracking [J].
Chen, Xin ;
Yan, Bin ;
Zhu, Jiawen ;
Wang, Dong ;
Yang, Xiaoyun ;
Lu, Huchuan .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :8122-8131
[4]  
Chi C., 2020, Adv. Neural Inf. Process Syst, V33, P13564
[5]   MixFormer: End-to-End Tracking with Iterative Mixed Attention [J].
Cui, Yutao ;
Jiang, Cheng ;
Wang, Limin ;
Wu, Gangshan .
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, :13598-13608
[6]   Pseudo loss active learning for deep visual tracking [J].
Cui, Zhiyan ;
Lu, Na ;
Wang, Weifeng .
PATTERN RECOGNITION, 2022, 130
[7]  
Dosovitskiy A, 2021, Arxiv, DOI [arXiv:2010.11929, DOI 10.48550/ARXIV.2010.11929]
[8]   LaSOT: A High-quality Benchmark for Large-scale Single Object Tracking [J].
Fan, Heng ;
Lin, Liting ;
Yang, Fan ;
Chu, Peng ;
Deng, Ge ;
Yu, Sijia ;
Bai, Hexin ;
Xu, Yong ;
Liao, Chunyuan ;
Ling, Haibin .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :5369-5378
[9]   STMTrack: Template-free Visual Tracking with Space-time Memory Networks [J].
Fu, Zhihong ;
Liu, Qingjie ;
Fu, Zehua ;
Wang, Yunhong .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :13769-13778
[10]   Visual object tracking via non-local correlation attention learning [J].
Gao, Long ;
Liu, Pan ;
Ning, Jifeng ;
Li, Yunsong .
KNOWLEDGE-BASED SYSTEMS, 2022, 254