MixFormer: End-to-End Tracking with Iterative Mixed Attention

被引:347
作者
Cui, Yutao [1 ]
Jiang, Cheng [1 ]
Wang, Limin [1 ]
Wu, Gangshan [1 ]
机构
[1] Nanjing Univ, State Key Lab Novel Software Technol, Nanjing, Jiangsu, Peoples R China
来源
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2022年
基金
中国国家自然科学基金;
关键词
D O I
10.1109/CVPR52688.2022.01324
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Tracking often uses a multi-stage pipeline of feature extraction, target information integration, and bounding box estimation. To simplify this pipeline and unify the process of feature extraction and target information integration, we present a compact tracking framework, termed as MixFormer, built upon transformers. Our core design is to utilize the flexibility of attention operations, and propose a Mixed Attention Module (MAM) for simultaneous feature extraction and target information integration. This synchronous modeling scheme allows to extract target-specific discriminative features and perform extensive communication between target and search area. Based on MAM, we build our MixFormer tracking framework simply by stacking multiple MAMs with progressive patch embedding and placing a localization head on top. In addition, to handle multiple target templates during online tracking, we devise an asymmetric attention scheme in MAM to reduce computational cost, and propose an effective score prediction module to select high-quality templates. Our MixFormer sets a new state-of-the-art performance on five tracking benchmarks, including LaSOT, TrackingNet, VOT2020, GOT-10k, and UAV123. In particular, our MixFormer-L achieves NP score of 79.9% on LaSOT, 88.9% on TrackingNet and EAO of 0.555 on VOT2020. We also perform in-depth ablation studies to demonstrate the effectiveness of simultaneous feature extraction and information integration. Code and trained models are publicly available at https://github.com/MCG-NJU/MixFormer.
引用
收藏
页码:13598 / 13608
页数:11
相关论文
共 64 条
  • [1] [Anonymous], 2019, CVPR, DOI DOI 10.1109/CVPR.2019.00432
  • [2] [Anonymous], 2020, CVPR, DOI DOI 10.1109/CVPR42600.2020.00630
  • [3] [Anonymous], 2016, P IEEE C COMPUTER VI, DOI DOI 10.1109/CVPR.2016.465
  • [4] [Anonymous], 2020, CVPR, DOI DOI 10.1109/CVPR42600.2020.00721
  • [5] Staple: Complementary Learners for Real-Time Tracking
    Bertinetto, Luca
    Valmadre, Jack
    Golodetz, Stuart
    Miksik, Ondrej
    Torr, Philip H. S.
    [J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 1401 - 1409
  • [6] Fully-Convolutional Siamese Networks for Object Tracking
    Bertinetto, Luca
    Valmadre, Jack
    Henriques, Joao F.
    Vedaldi, Andrea
    Torr, Philip H. S.
    [J]. COMPUTER VISION - ECCV 2016 WORKSHOPS, PT II, 2016, 9914 : 850 - 865
  • [7] Learning Discriminative Model Prediction for Tracking
    Bhat, Goutam
    Danelljan, Martin
    Van Gool, Luc
    Timofte, Radu
    [J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 6181 - 6190
  • [8] Unveiling the Power of Deep Tracking
    Bhat, Goutam
    Johnander, Joakim
    Danelljan, Martin
    Khan, Fahad Shahbaz
    Felsberg, Michael
    [J]. COMPUTER VISION - ECCV 2018, PT II, 2018, 11206 : 493 - 509
  • [9] New results on global asymptotic stability for a nonlinear density-dependent mortality Nicholson's blowflies model with multiple pairs of time-varying delays
    Cao, Qian
    Wang, Guoqiu
    Zhang, Hong
    Gong, Shuhua
    [J]. JOURNAL OF INEQUALITIES AND APPLICATIONS, 2020, 2020 (01)
  • [10] Carion N., 2020, EUROPEAN C COMPUTER