Video Mask Transfiner for High-Quality Video Instance Segmentation

被引:6
作者
Ke, Lei [1 ,2 ]
Ding, Henghui [1 ]
Danelljan, Martin [1 ]
Tai, Yu-Wing [3 ]
Tang, Chi-Keung [2 ]
Yu, Fisher [1 ]
机构
[1] Swiss Fed Inst Technol, Comp Vis Lab, Zurich, Switzerland
[2] Hong Kong Univ Sci & Technol, Hong Kong, Peoples R China
[3] Kuaishou Technol, Beijing, Peoples R China
来源
COMPUTER VISION - ECCV 2022, PT XXVIII | 2022年 / 13688卷
关键词
Video instance segmentation; Multiple object tracking and segmentation; Video mask transfiner; Iterative training; Self-correction;
D O I
10.1007/978-3-031-19815-1_42
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
While Video Instance Segmentation (VIS) has seen rapid progress, current approaches struggle to predict high-quality masks with accurate boundary details. Moreover, the predicted segmentations often fluctuate over time, suggesting that temporal consistency cues are neglected or not fully utilized. In this paper, we set out to tackle these issues, with the aim of achieving highly detailed and more temporally stable mask predictions for VIS. We first propose the Video Mask Transfiner (VMT) method, capable of leveraging fine-grained high-resolution features thanks to a highly efficient video transformer structure. Our VMT detects and groups sparse error-prone spatio-temporal regions of each tracklet in the video segment, which are then refined using both local and instance-level cues. Second, we identify that the coarse boundary annotations of the popular YouTube-VIS dataset constitute a major limiting factor. Based on our VMT architecture, we therefore design an automated annotation refinement approach by iterative training and self-correction. To benchmark high-quality mask predictions for VIS, we introduce the HQ-YTVIS dataset, consisting of a manually re-annotated test set and our automatically refined training data. We compare VMT with the most recent state-of-the-art methods on the HQ-YTVIS, as well as the Youtube-VIS, OVIS and BDD100K MOTS benchmarks. Experimental results clearly demonstrate the efficacy and effectiveness of our method on segmenting complex and dynamic objects, by capturing precise details.
引用
收藏
页码:731 / 747
页数:17
相关论文
共 43 条
[1]   Video Based Reconstruction of 3D People Models [J].
Alldieck, Thiemo ;
Magnor, Marcus ;
Xu, Weipeng ;
Theobalt, Christian ;
Pons-Moll, Gerard .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :8387-8397
[2]   STEm-Seg: Spatio-Temporal Embeddings for Instance Segmentation in Videos [J].
Athar, Ali ;
Mahadevan, Sabarinath ;
Osep, Aljosa ;
Leal-Taixe, Laura ;
Leibe, Bastian .
COMPUTER VISION - ECCV 2020, PT XI, 2020, 12356 :158-177
[3]   Classifying, Segmenting, and Tracking Object Instances in Video with Mask Propagation [J].
Bertasius, Gedas ;
Torresani, Lorenzo .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :9736-9745
[4]  
Bolya Daniel, 2019, ICCV
[5]   SipMask: Spatial Information Preservation for Fast Image and Video Instance Segmentation [J].
Cao, Jiale ;
Anwer, Rao Muhammad ;
Cholakkal, Hisham ;
Khan, Fahad Shahbaz ;
Pang, Yanwei ;
Shao, Ling .
COMPUTER VISION - ECCV 2020, PT XIV, 2020, 12359 :1-18
[6]   End-to-End Object Detection with Transformers [J].
Carion, Nicolas ;
Massa, Francisco ;
Synnaeve, Gabriel ;
Usunier, Nicolas ;
Kirillov, Alexander ;
Zagoruyko, Sergey .
COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :213-229
[7]   Naive-Student: Leveraging Semi-Supervised Learning in Video Sequences for Urban Scene Segmentation [J].
Chen, Liang-Chieh ;
Lopes, Raphael Gontijo ;
Cheng, Bowen ;
Collins, Maxwell D. ;
Cubuk, Ekin D. ;
Zoph, Barret ;
Adam, Hartwig ;
Shlens, Jonathon .
COMPUTER VISION - ECCV 2020, PT IX, 2020, 12354 :695-714
[8]  
Cheng B., 2021, CVPR
[9]   CascadePSP: Toward Class-Agnostic and Very High-Resolution Segmentation via Global and Local Refinement [J].
Cheng, Ho Kei ;
Chung, Jihoon ;
Tai, Yu-Wing ;
Tang, Chi-Keung .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :8887-8896
[10]   Boundary-Aware Feature Propagation for Scene Segmentation [J].
Ding, Henghui ;
Jiang, Xudong ;
Liu, Ai Qun ;
Thalmann, Nadia Magnenat ;
Wang, Gang .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :6818-6828