DeNoising-MOT: Towards Multiple Object Tracking with Severe Occlusions

被引:0
作者
Fu, Teng [1 ]
Wang, Xiaocong [1 ]
Yu, Haiyang [1 ]
Niu, Ke [1 ]
Li, Bin [1 ]
Xue, Xiangyang [1 ]
机构
[1] Fudan Univ, Sch Comp Sci, Shanghai Key Lab IIP, Shanghai, Peoples R China
来源
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023 | 2023年
基金
中国国家自然科学基金;
关键词
Multiple object tracking; Transformer; Occlusion handling; Set prediction;
D O I
10.1145/3581783.3611728
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Multiple object tracking (MOT) tends to become more challenging when severe occlusions occur. In this paper, we analyze the limitations of traditional Convolutional Neural Network-based methods and Transformer-based methods in handling occlusions and propose DNMOT, an end-to-end trainable DeNoising Transformer for MOT. To address the challenge of occlusions, we explicitly simulate the scenarios when occlusions occur. Specifically, we augment the trajectory with noises during training and make our model learn the denoising process in an encoder-decoder architecture, so that our model can exhibit strong robustness and perform well under crowded scenes. Additionally, we propose a Cascaded Mask strategy to better coordinate the interaction between different types of queries in the decoder to prevent the mutual suppression between neighboring trajectories under crowded scenes. Notably, the proposed method requires no additional modules like matching strategy and motion state estimation in inference. We conduct extensive experiments on the MOT17, MOT20, and DanceTrack datasets, and the experimental results show that our method out-performs previous state-of-the-art methods by a clear margin.
引用
收藏
页码:2734 / 2743
页数:10
相关论文
共 94 条
  • [1] Aharon Nir, 2022, ARXIV220614651
  • [2] Social LSTM: Human Trajectory Prediction in Crowded Spaces
    Alahi, Alexandre
    Goel, Kratarth
    Ramanathan, Vignesh
    Robicquet, Alexandre
    Li Fei-Fei
    Savarese, Silvio
    [J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 961 - 971
  • [3] Andriyenko A, 2011, PROC CVPR IEEE, P1265, DOI 10.1109/CVPR.2011.5995311
  • [4] ViViT: A Video Vision Transformer
    Arnab, Anurag
    Dehghani, Mostafa
    Heigold, Georg
    Sun, Chen
    Lucic, Mario
    Schmid, Cordelia
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 6816 - 6826
  • [5] Multiple Object Tracking Using K-Shortest Paths Optimization
    Berclaz, Jerome
    Fleuret, Francois
    Tueretken, Engin
    Fua, Pascal
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2011, 33 (09) : 1806 - 1819
  • [6] Tracking without bells and whistles
    Bergmann, Philipp
    Meinhardt, Tim
    Leal-Taixe, Laura
    [J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 941 - 951
  • [7] Evaluating Multiple Object Tracking Performance: The CLEAR MOT Metrics
    Bernardin, Keni
    Stiefelhagen, Rainer
    [J]. EURASIP JOURNAL ON IMAGE AND VIDEO PROCESSING, 2008, 2008 (1)
  • [8] Bertasius G, 2021, PR MACH LEARN RES, V139
  • [9] Bewley A, 2016, IEEE IMAGE PROC, P3464, DOI 10.1109/ICIP.2016.7533003
  • [10] Bochinski Erik, 2017, 2017 14th IEEE International Conference on Advanced Video and Signal-Based Surveillance (AVSS), DOI 10.1109/AVSS.2017.8078516