Solve the Puzzle of Instance Segmentation in Videos: A Weakly Supervised Framework With Spatio-Temporal Collaboration

被引:33
|
作者
Yan, Liqi [1 ,2 ]
Wang, Qifan [3 ]
Ma, Siqi [2 ]
Wang, Jingang [4 ]
Yu, Changbin [5 ,6 ]
机构
[1] Fudan Univ, Westlake Inst Adv Study, Shanghai 200437, Peoples R China
[2] Westlake Univ, Sch Engn, Hangzhou 310024, Peoples R China
[3] Meta AI, Menlo Pk, CA 94025 USA
[4] Meituan, Beijing 100102, Peoples R China
[5] Shandong First Med Univ & Shandong Acad Med Sci, Coll Artificial Intelligence & Big Data Med Sci, Jinan 250021, Peoples R China
[6] Fudan Univ, Inst Intelligent Robots, Shanghai 200437, Peoples R China
基金
美国国家科学基金会;
关键词
Video instance segmentation; weakly supervised learning; multi-object tracking and segmentation; OBJECT SEGMENTATION; IMAGE;
D O I
10.1109/TCSVT.2022.3202574
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Instance segmentation in videos, which aims to segment and track multiple objects in video frames, has garnered a flurry of research attention in recent years. In this paper, we present a novel weakly supervised framework with Spatio-Temporal Collaboration for instance Segmentation in videos, namely STC-Seg. Concretely, STC-Seg demonstrates four contributions. First, we leverage the complementary representations from unsupervised depth estimation and optical flow to produce effective pseudo-labels for training deep networks and predicting high-quality instance masks. Second, to enhance the mask generation, we devise a puzzle loss, which enables end-to-end training using box-level annotations. Third, our tracking module jointly utilizes bounding-box diagonal points with spatio-temporal discrepancy to model movements, which largely improves the robustness to different object appearances. Finally, our framework is flexible and enables image-level instance segmentation methods to operate the video-level task. We conduct an extensive set of experiments on the KITTI MOTS and YT-VIS datasets. Experimental results demonstrate that our method achieves strong performance and even outperforms fully supervised TrackR-CNN and MaskTrack R-CNN. We believe that STC-Seg can be a valuable addition to the community, as it reflects the tip of an iceberg about the innovative opportunities in the weakly supervised paradigm for instance segmentation in videos.
引用
收藏
页码:393 / 406
页数:14
相关论文
共 50 条
  • [1] Weakly Supervised Instance Segmentation for Videos with Temporal Mask Consistency
    Liu, Qing
    Ramanathan, Vignesh
    Mahajan, Dhruv
    Yuille, Alan
    Yang, Zhenheng
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 13963 - 13973
  • [2] Activity-driven Weakly-Supervised Spatio-Temporal Grounding from Untrimmed Videos
    Chen, Junwen
    Bao, Wentao
    Kong, Yu
    MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 3789 - 3797
  • [3] Spatio-temporal Attention Network for Video Instance Segmentation
    Liu, Xiaoyu
    Ren, Haibing
    Ye, Tingmeng
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW), 2019, : 725 - 727
  • [4] Weakly supervised activity analysis with spatio-temporal localisation
    Gu, Feng
    Sridhar, Muralikrishna
    Cohn, Anthony
    Hogg, David
    Florez-Revuelta, Francisco
    Monekosso, Dorothy
    Remagnino, Paolo
    NEUROCOMPUTING, 2016, 216 : 778 - 789
  • [5] Video-SwinUNet: Spatio-temporal Deep Learning Framework for VFSS Instance Segmentation
    Zeng, Chengxi
    Yang, Xinyu
    Smithard, David
    Mirmehdi, Majid
    Gambaruto, Alberto M.
    Burghardt, Tilo
    2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 2470 - 2474
  • [6] Object Instance Search in Videos via Spatio-Temporal Trajectory Discovery
    Meng, Jingjing
    Yuan, Junsong
    Yang, Jiong
    Wang, Gang
    Tan, Yap-Peng
    IEEE TRANSACTIONS ON MULTIMEDIA, 2016, 18 (01) : 116 - 127
  • [7] Spatio-Temporal Deformable DETR for Weakly Supervised Defect Localization
    Kim, Young-Min
    Yoo, Yong-Ho
    Yoon, In-Ug
    Myung, Hyun
    Kim, Jong-Hwan
    IEEE SENSORS JOURNAL, 2023, 23 (17) : 19935 - 19945
  • [8] A spatio-temporal network for video semantic segmentation in surgical videos
    Maria Grammatikopoulou
    Ricardo Sanchez-Matilla
    Felix Bragman
    David Owen
    Lucy Culshaw
    Karen Kerr
    Danail Stoyanov
    Imanol Luengo
    International Journal of Computer Assisted Radiology and Surgery, 2024, 19 : 375 - 382
  • [9] A spatio-temporal network for video semantic segmentation in surgical videos
    Grammatikopoulou, Maria
    Sanchez-Matilla, Ricardo
    Bragman, Felix
    Owen, David
    Culshaw, Lucy
    Kerr, Karen
    Stoyanov, Danail
    Luengo, Imanol
    INTERNATIONAL JOURNAL OF COMPUTER ASSISTED RADIOLOGY AND SURGERY, 2023, 19 (2) : 375 - 382
  • [10] A spatio-temporal network for video semantic segmentation in surgical videos
    Grammatikopoulou, Maria
    Sanchez-Matilla, Ricardo
    Bragman, Felix
    Owen, David
    Culshaw, Lucy
    Kerr, Karen
    Stoyanov, Danail
    Luengo, Imanol
    arXiv, 2023,