Occluded Video Instance Segmentation: A Benchmark

被引:0
作者
Jiyang Qi
Yan Gao
Yao Hu
Xinggang Wang
Xiaoyu Liu
Xiang Bai
Serge Belongie
Alan Yuille
Philip H. S. Torr
Song Bai
机构
[1] Huazhong University of Science and Technology,
[2] Alibaba Group,undefined
[3] University of Copenhagen,undefined
[4] Johns Hopkins University,undefined
[5] University of Oxford,undefined
来源
International Journal of Computer Vision | 2022年 / 130卷
关键词
Video instance segmentation; Occlusion reasoning; Dataset; Video understanding; Benchmark; 68T07; 68T45;
D O I
暂无
中图分类号
学科分类号
摘要
Can our video understanding systems perceive objects when a heavy occlusion exists in a scene? To answer this question, we collect a large-scale dataset called OVIS for occluded video instance segmentation, that is, to simultaneously detect, segment, and track instances in occluded scenes. OVIS consists of 296k high-quality instance masks from 25 semantic categories, where object occlusions usually occur. While our human vision systems can understand those occluded instances by contextual reasoning and association, our experiments suggest that current video understanding systems cannot. On the OVIS dataset, the highest AP achieved by state-of-the-art algorithms is only 16.3, which reveals that we are still at a nascent stage for understanding objects, instances, and videos in a real-world scenario. We also present a simple plug-and-play module that performs temporal feature calibration to complement missing object cues caused by occlusion. Built upon MaskTrack R-CNN and SipMask, we obtain a remarkable AP improvement on the OVIS dataset. The OVIS dataset and project code are available at http://songbai.site/ovis.
引用
收藏
页码:2022 / 2039
页数:17
相关论文
共 45 条
  • [1] Brostow GJ(2009)Semantic object classes in video: A high-definition ground truth database Pattern Recognition Letters 30 88-97
  • [2] Fauqueur J(2017)Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs IEEE TPAMI 40 834-848
  • [3] Cipolla R(2008)Preferential responses to occluded objects in the human visual cortex JOV 8 16-16
  • [4] Chen LC(2021)Compositional convolutional neural networks: A robust and interpretable model for object recognition under occlusion IJCV 129 736-760
  • [5] Papandreou G(1989)Stereoscopic depth: its relation to image segmentation, grouping, and the recognition of occluded objects Perception 18 55-68
  • [6] Kokkinos I(2015)Imagenet large scale visual recognition challenge IJCV 115 211-252
  • [7] Murphy K(2013)Visual tracking: An experimental survey IEEE TPAMI 36 1442-1468
  • [8] Yuille AL(2020)UA-DETRAC: A new benchmark and protocol for multi-object detection and tracking Computer Vision and Image Understanding 193 undefined-undefined
  • [9] Hegdé J(undefined)undefined undefined undefined undefined-undefined
  • [10] Fang F(undefined)undefined undefined undefined undefined-undefined