OffsetNet: Towards Efficient Multiple Object Tracking, Detection, and Segmentation

被引:1
作者
Zhang, Wei [1 ]
Li, Jiaming [2 ]
Xia, Meng [2 ]
Gao, Xu [1 ]
Tan, Xiao [1 ]
Shi, Yifeng [1 ]
Huang, Zhenhua [3 ]
Li, Guanbin [2 ,4 ]
机构
[1] Baidu Inc, Beijing 100085, Peoples R China
[2] Sun Yat Sen Univ, Res Inst, Sch Comp Sci & Engn, Guangzhou 510006, Peoples R China
[3] South China Normal Univ, Sch Comp Sci, Guangzhou 510631, Peoples R China
[4] Peng Cheng Lab, Shenzhen 518066, Peoples R China
基金
中国国家自然科学基金;
关键词
Decoding; Feature extraction; Instance segmentation; Semantics; Real-time systems; Multitasking; Three-dimensional displays; Pipelines; Streaming media; Object tracking; Multi-Object tracking; object detection; object segmentation; MULTIOBJECT TRACKING;
D O I
10.1109/TPAMI.2024.3485644
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Offset-based representation has emerged as a promising approach for modeling semantic relations between pixels and object motion, demonstrating efficacy across various computer vision tasks. In this paper, we introduce a novel one-stage multi-tasking network tailored to extend the offset-based approach to MOTS. Our proposed framework, named OffsetNet, is designed to concurrently address amodal bounding box detection, instance segmentation, and tracking. It achieves this by formulating these three tasks within a unified pixel-offset-based representation, thereby achieving excellent efficiency and encouraging mutual collaborations. OffsetNet achieves several remarkable properties: first, the encoder is empowered by a novel Memory Enhanced Linear Self-Attention (MELSA) block to efficiently aggregate spatial-temporal features; second, all tasks are decoupled fairly using three lightweight decoders that operate in a one-shot manner; third, a novel cross-frame offsets prediction module is proposed to enhance the robustness of tracking against occlusions. With these merits, OffsetNet achieves 76.83% HOTA on KITTI MOTS benchmark, which is the best result without relying on 3D detection. Furthermore, OffsetNet achieves 74.83% HOTA at 50 FPS on the KITTI MOT benchmark, which is nearly 3.3 times faster than CenterTrack with better performance. We hope our approach will serve as a solid baseline and encourage future research in this field.
引用
收藏
页码:949 / 960
页数:12
相关论文
共 75 条
[1]   STEm-Seg: Spatio-Temporal Embeddings for Instance Segmentation in Videos [J].
Athar, Ali ;
Mahadevan, Sabarinath ;
Osep, Aljosa ;
Leal-Taixe, Laura ;
Leibe, Bastian .
COMPUTER VISION - ECCV 2020, PT XI, 2020, 12356 :158-177
[2]   Tracking without bells and whistles [J].
Bergmann, Philipp ;
Meinhardt, Tim ;
Leal-Taixe, Laura .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :941-951
[3]   Evaluating Multiple Object Tracking Performance: The CLEAR MOT Metrics [J].
Bernardin, Keni ;
Stiefelhagen, Rainer .
EURASIP JOURNAL ON IMAGE AND VIDEO PROCESSING, 2008, 2008 (1)
[4]  
Bewley A, 2016, IEEE IMAGE PROC, P3464, DOI 10.1109/ICIP.2016.7533003
[5]   Learning a Robust Society of Tracking Parts Using Co-occurrence Constraints [J].
Burceanu, Elena ;
Leordeanu, Marius .
COMPUTER VISION - ECCV 2018 WORKSHOPS, PT I, 2019, 11129 :162-178
[6]   MeMOT: Multi-Object Tracking with Memory [J].
Cai, Jiarui ;
Xu, Mingze ;
Li, Wei ;
Xiong, Yuanjun ;
Xia, Wei ;
Tu, Zhuowen ;
Soatto, Stefano .
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, :8080-8090
[7]   End-to-End Object Detection with Transformers [J].
Carion, Nicolas ;
Massa, Francisco ;
Synnaeve, Gabriel ;
Usunier, Nicolas ;
Kirillov, Alexander ;
Zagoruyko, Sergey .
COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :213-229
[8]  
Chaabane M, 2021, Arxiv, DOI arXiv:2102.02267
[9]   Factors Influencing Pediatric Emergency Department Visits for Low-Acuity Conditions [J].
Long, Christina M. ;
Mehrhoff, Casey ;
Abdel-Latief, Eman ;
Rech, Megan ;
Laubham, Matthew .
PEDIATRIC EMERGENCY CARE, 2021, 37 (05) :265-268
[10]  
Choromanski K, 2021, Arxiv, DOI arXiv:2009.14794