Looking Beyond Two Frames: End-to-End Multi-Object Tracking Using Spatial and Temporal Transformers

被引:21
作者
Zhu, Tianyu [1 ]
Hiller, Markus [2 ]
Ehsanpour, Mahsa [3 ]
Ma, Rongkai [1 ]
Drummond, Tom [2 ]
Reid, Ian
Rezatofighi, Hamid [4 ]
机构
[1] Monash Univ, Dept Elect & Comp Syst Engn, Clayton, Vic 3800, Australia
[2] Univ Melbourne, Sch Comp & Informat Syst, Parkville, Vic 3010, Australia
[3] Univ Adelaide, Australian Inst Machine Learning, Adelaide, SA 5005, Australia
[4] Monash Univ, Dept Data Sci & AI, Clayton, Vic 3800, Australia
关键词
Tracking; Transformers; Task analysis; Visualization; Object recognition; History; Feature extraction; Multi-object tracking; transformer; spatio-temporal model; pedestrian tracking; end-to-end learning; MULTITARGET;
D O I
10.1109/TPAMI.2022.3213073
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Tracking a time-varying indefinite number of objects in a video sequence over time remains a challenge despite recent advances in the field. Most existing approaches are not able to properly handle multi-object tracking challenges such as occlusion, in part because they ignore long-term temporal information. To address these shortcomings, we present MO3TR: a truly end-to-end Transformer-based online multi-object tracking (MOT) framework that learns to handle occlusions, track initiation and termination without the need for an explicit data association module or any heuristics. MO3TR encodes object interactions into long-term temporal embeddings using a combination of spatial and temporal Transformers, and recursively uses the information jointly with the input data to estimate the states of all tracked objects over time. The spatial attention mechanism enables our framework to learn implicit representations between all the objects and the objects to the measurements, while the temporal attention mechanism focuses on specific parts of past information, allowing our approach to resolve occlusions over multiple frames. Our experiments demonstrate the potential of this new approach, achieving results on par with or better than the current state-of-the-art on multiple MOT metrics for several popular multi-object tracking benchmarks.
引用
收藏
页码:12783 / 12797
页数:15
相关论文
共 26 条
  • [1] Tracking Beyond Detection: Learning a Global Response Map for End-to-End Multi-Object Tracking
    Wan, Xingyu
    Cao, Jiakai
    Zhou, Sanping
    Wang, Jinjun
    Zheng, Nanning
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2021, 30 : 8222 - 8235
  • [2] Joint Detection and Association for End-to-End Multi-object Tracking
    Li, Ye
    Luo, Xiaoyu
    Shi, Junyu
    Wang, Xinzhong
    Yin, Guangqiang
    Wang, Zhiguo
    NEURAL PROCESSING LETTERS, 2023, 55 (09) : 11823 - 11844
  • [3] TransVOD: End-to-End Video Object Detection With Spatial-Temporal Transformers
    Zhou, Qianyu
    Li, Xiangtai
    He, Lu
    Yang, Yibo
    Cheng, Guangliang
    Tong, Yunhai
    Ma, Lizhuang
    Tao, Dacheng
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (06) : 7853 - 7869
  • [4] Joint Detection and Association for End-to-End Multi-object Tracking
    Ye Li
    Xiaoyu Luo
    Junyu Shi
    Xinzhong Wang
    Guangqiang Yin
    Zhiguo Wang
    Neural Processing Letters, 2023, 55 : 11823 - 11844
  • [5] End-to-End Video Object Detection with Spatial-Temporal Transformers
    He, Lu
    Zhou, Qianyu
    Li, Xiangtai
    Niu, Li
    Cheng, Guangliang
    Li, Xiao
    Liu, Wenxuan
    Tong, Yunhai
    Ma, Lizhuang
    Zhang, Liqing
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 1507 - 1516
  • [6] INTEGRATING MOTION PRIORS FOR END-TO-END ATTENTION-BASED MULTI-OBJECT TRACKING
    Ali, R.
    Mehltretter, M.
    Heipke, C.
    GEOSPATIAL WEEK 2023, VOL. 48-1, 2023, : 1619 - 1626
  • [7] Boosting End-to-end Multi-Object Tracking and Person Search via Knowledge Distillation
    Zhang, Wei
    He, Lingxiao
    Cheng, Peng
    Liao, Xingyu
    Liu, Wu
    Li, Qi
    Sun, Zhenan
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 1192 - 1201
  • [8] End-to-End Learning Deep CRF Models for Multi-Object Tracking Deep CRF Models
    Xiang, Jun
    Xu, Guohan
    Ma, Chao
    Hou, Jianhua
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2021, 31 (01) : 275 - 288
  • [9] End-to-End Chained Pedestrian Multi-Object Tracking Based on Multi-Feature Fusion
    Zhou, Haiyun
    Xiang, Xuezhi
    Wang, Xinyao
    Ren, Wenkai
    PROCEEDINGS OF 2021 IEEE 12TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING AND SERVICE SCIENCE (ICSESS), 2021, : 150 - 153
  • [10] An end-to-end identity association network based on geometry refinement for multi-object tracking
    Li, Rui
    Zhang, Baopeng
    Teng, Zhu
    Fan, Jianping
    PATTERN RECOGNITION, 2022, 129