Robust Visual Tracking Using Hierarchical Vision Transformer with Shifted Windows Multi-Head Self-Attention

被引：1

作者：

Gao, Peng ^{[1
]}

Zhang, Xin-Yue ^{[1
]}

Yang, Xiao-Li ^{[1
]}

Ni, Jian-Cheng ^{[1
]}

Wang, Fei ^{[2
]}

机构：

[1] Qufu Normal Univ, Sch Cyber Sci & Engn, Qufu 273165, Shandong, Peoples R China

[2] Harbin Inst Technol, Sch Elect & Informat En gineering, Shenzhen, Guangdong, Peoples R China

来源：

IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS | 2024年 / E107D卷 / 01期

基金：

中国博士后科学基金;

关键词：

Siamese network; visual tracking; vision transformer; self-attention;

D O I：

10.1587/transinf.2023EDL8053

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Despite Siamese trackers attracting much attention due to their scalability and efficiency in recent years, researchers have ignored the background appearance, which leads to their inapplicability in recognizing arbitrary target objects with various variations, especially in complex scenarios with background clutter and distractors. In this paper, we present a simple yet effective Siamese tracker, where the shifted windows multi-head self-attention is produced to learn the characteristics of a specific given target object for visual tracking. To validate the effectiveness of our proposed tracker, we use the Swin Transformer as the backbone network and introduced an auxiliary feature enhancement network. Extensive experimental results on two evaluation datasets demonstrate that the proposed tracker outperforms other baselines.

引用

页码：161 / 164

页数：4

共 16 条

[1] Fully-Convolutional Siamese Networks for Object Tracking [J].

Bertinetto, Luca ;

Valmadre, Jack ;

Henriques, Joao F. ;

Vedaldi, Andrea ;

Torr, Philip H. S. .

COMPUTER VISION - ECCV 2016 WORKSHOPS, PT II, 2016, 9914 :850-865

[2] Know Your Surroundings: Exploiting Scene Information for Object Tracking [J].

Bhat, Goutam ;

Danelljan, Martin ;

Van Gool, Luc ;

Timofte, Radu .

COMPUTER VISION - ECCV 2020, PT XXIII, 2020, 12368 :205-221

[3] Learning Discriminative Model Prediction for Tracking [J].

Bhat, Goutam ;

Danelljan, Martin ;

Van Gool, Luc ;

Timofte, Radu .

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :6181-6190

[4] Joint Classification and Regression for Visual Tracking with Fully Convolutional Siamese Networks [J].

Cui, Ying ;

Guo, Dongyan ;

Shao, Yanyan ;

Wang, Zhenhua ;

Shen, Chunhua ;

Zhang, Liyan ;

Chen, Shengyong .

INTERNATIONAL JOURNAL OF COMPUTER VISION, 2022, 130 (02) :550-566

[5] Learning reinforced attentional representation for end-to-end visual tracking [J].

Gao, Peng ;

Zhang, Qiquan ;

Wang, Fei ;

Xiao, Liyi ;

Fujita, Hamido ;

Zhang, Yan .

INFORMATION SCIENCES, 2020, 517 :52-67

[6] Deep Residual Learning for Image Recognition [J].

He, Kaiming ;

Zhang, Xiangyu ;

Ren, Shaoqing ;

Sun, Jian .

2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :770-778

[7] GOT-10k: A Large High-Diversity Benchmark for Generic Object Tracking in the Wild [J].

Huang, Lianghua ;

Zhao, Xin ;

Huang, Kaiqi .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2021, 43 (05) :1562-1577

[8] SiamRPN plus plus : Evolution of Siamese Visual Tracking with Very Deep Networks [J].

Li, Bo ;

Wu, Wei ;

Wang, Qiang ;

Zhang, Fangyi ;

Xing, Junliang ;

Yan, Junjie .

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :4277-4286

[9] Microsoft COCO: Common Objects in Context [J].

Lin, Tsung-Yi ;

Maire, Michael ;

Belongie, Serge ;

Hays, James ;

Perona, Pietro ;

Ramanan, Deva ;

Dollar, Piotr ;

Zitnick, C. Lawrence .

COMPUTER VISION - ECCV 2014, PT V, 2014, 8693 :740-755

[10] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows [J].

Liu, Ze ;

Lin, Yutong ;

Cao, Yue ;

Hu, Han ;

Wei, Yixuan ;

Zhang, Zheng ;

Lin, Stephen ;

Guo, Baining .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :9992-10002

← 1 2 →