DeNoising-MOT: Towards Multiple Object Tracking with Severe Occlusions

被引：0

作者：

Fu, Teng ^{[1
]}

Wang, Xiaocong ^{[1
]}

Yu, Haiyang ^{[1
]}

Niu, Ke ^{[1
]}

Li, Bin ^{[1
]}

Xue, Xiangyang ^{[1
]}

机构：

[1] Fudan Univ, Sch Comp Sci, Shanghai Key Lab IIP, Shanghai, Peoples R China

来源：

PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023 | 2023年

基金：

中国国家自然科学基金;

关键词：

Multiple object tracking; Transformer; Occlusion handling; Set prediction;

D O I：

10.1145/3581783.3611728

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Multiple object tracking (MOT) tends to become more challenging when severe occlusions occur. In this paper, we analyze the limitations of traditional Convolutional Neural Network-based methods and Transformer-based methods in handling occlusions and propose DNMOT, an end-to-end trainable DeNoising Transformer for MOT. To address the challenge of occlusions, we explicitly simulate the scenarios when occlusions occur. Specifically, we augment the trajectory with noises during training and make our model learn the denoising process in an encoder-decoder architecture, so that our model can exhibit strong robustness and perform well under crowded scenes. Additionally, we propose a Cascaded Mask strategy to better coordinate the interaction between different types of queries in the decoder to prevent the mutual suppression between neighboring trajectories under crowded scenes. Notably, the proposed method requires no additional modules like matching strategy and motion state estimation in inference. We conduct extensive experiments on the MOT17, MOT20, and DanceTrack datasets, and the experimental results show that our method out-performs previous state-of-the-art methods by a clear margin.

引用

页码：2734 / 2743

页数：10

共 94 条

[1] Aharon Nir, 2022, ARXIV220614651
[2] Social LSTM: Human Trajectory Prediction in Crowded Spaces
Alahi, Alexandre
Goel, Kratarth
Ramanathan, Vignesh
Robicquet, Alexandre
Li Fei-Fei
Savarese, Silvio
[J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 961 - 971
[3] Andriyenko A, 2011, PROC CVPR IEEE, P1265, DOI 10.1109/CVPR.2011.5995311
[4] ViViT: A Video Vision Transformer
Arnab, Anurag
Dehghani, Mostafa
Heigold, Georg
Sun, Chen
Lucic, Mario
Schmid, Cordelia
[J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 6816 - 6826
[5] Multiple Object Tracking Using K-Shortest Paths Optimization
Berclaz, Jerome
Fleuret, Francois
Tueretken, Engin
Fua, Pascal
[J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2011, 33 (09) : 1806 - 1819
[6] Tracking without bells and whistles
Bergmann, Philipp
Meinhardt, Tim
Leal-Taixe, Laura
[J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 941 - 951
[7] Evaluating Multiple Object Tracking Performance: The CLEAR MOT Metrics
Bernardin, Keni
Stiefelhagen, Rainer
[J]. EURASIP JOURNAL ON IMAGE AND VIDEO PROCESSING, 2008, 2008 (1)
[8] Bertasius G, 2021, PR MACH LEARN RES, V139
[9] Bewley A, 2016, IEEE IMAGE PROC, P3464, DOI 10.1109/ICIP.2016.7533003
[10] Bochinski Erik, 2017, 2017 14th IEEE International Conference on Advanced Video and Signal-Based Surveillance (AVSS), DOI 10.1109/AVSS.2017.8078516

← 1 2 3 4 5 6 7 8 9 10 →