Weakly-Supervised RGBD Video Object Segmentation

被引:0
作者
Yang, Jinyu [1 ,2 ]
Gao, Mingqi [1 ,3 ]
Zheng, Feng [4 ]
Zhen, Xiantong [5 ]
Ji, Rongrong [6 ]
Shao, Ling [7 ]
Leonardis, Ales [8 ]
机构
[1] Southern Univ Sci & Technol, Dept Comp Sci & Engn, Shenzhen 518055, Peoples R China
[2] Univ Birmingham, Birmingham B15 2TT, England
[3] Univ Warwick, Coventry CV4 7AL, England
[4] Southern Univ Sci & Technol, Shenzhen 518055, Peoples R China
[5] Guangdong Univ Petrochem Technol, Coll Comp Sci, Maoming 525011, Peoples R China
[6] Xiamen Univ, Sch Informat, Dept Artificial Intelligence, Media Analyt & Comp Lab, Xiamen 361005, Peoples R China
[7] Univ Chinese Acad Sci, UCAS Terminus AI Lab, Beijing 101408, Peoples R China
[8] Univ Birmingham, Sch Comp Sci, Birmingham B15 2TT, England
基金
中国国家自然科学基金;
关键词
Annotations; Object segmentation; Training; Target tracking; Task analysis; Object tracking; Benchmark testing; RGBD data; video object segmentation; visual tracking; TRACKING;
D O I
10.1109/TIP.2024.3374130
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Depth information opens up new opportunities for video object segmentation (VOS) to be more accurate and robust in complex scenes. However, the RGBD VOS task is largely unexplored due to the expensive collection of RGBD data and time-consuming annotation of segmentation. In this work, we first introduce a new benchmark for RGBD VOS, named DepthVOS, which contains 350 videos (over 55k frames in total) annotated with masks and bounding boxes. We futher propose a novel, strong baseline model - Fused Color-Depth Network (FusedCDNet), which can be trained solely under the supervision of bounding boxes, while being used to generate masks with a bounding box guideline only in the first frame. Thereby, the model possesses three major advantages: a weakly-supervised training strategy to overcome the high-cost annotation, a cross-modal fusion module to handle complex scenes, and weakly-supervised inference to promote ease of use. Extensive experiments demonstrate that our proposed method performs on par with top fully-supervised algorithms. We will open-source our project on https://github.com/yjybuaa/depthvos/ to facilitate the development of RGBD VOS.
引用
收藏
页码:2158 / 2170
页数:13
相关论文
共 62 条
  • [1] Bhat Goutam, 2020, COMPUTER VISION ECCV, DOI 10.1007/978-3-030-58536-5\\ 46
  • [2] One-Shot Video Object Segmentation
    Caelles, S.
    Maninis, K. -K.
    Pont-Tuset, J.
    Leal-Taixe, L.
    Cremers, D.
    Van Gool, L.
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 5320 - 5329
  • [3] A Benchmarking Framework for Background Subtraction in RGBD Videos
    Camplani, Massimo
    Maddalena, Lucia
    Alcover, Gabriel Moya
    Petrosino, Alfredo
    Salgado, Luis
    [J]. NEW TRENDS IN IMAGE ANALYSIS AND PROCESSING - ICIAP 2017, 2017, 10590 : 219 - 229
  • [4] Spatial Information Guided Convolution for Real-Time RGBD Semantic Segmentation
    Chen, Lin-Zhuo
    Lin, Zheng
    Wang, Ziqin
    Yang, Yong-Liang
    Cheng, Ming-Ming
    [J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2021, 30 : 2313 - 2324
  • [5] Cheng HK, 2021, ADV NEUR IN, V34
  • [6] XMem: Long-Term Video Object Segmentation with an Atkinson-Shiffrin Memory Model
    Cheng, Ho Kei
    Schwing, Alexander G.
    [J]. COMPUTER VISION - ECCV 2022, PT XXVIII, 2022, 13688 : 640 - 658
  • [7] Tackling Background Distraction in Video Object Segmentation
    Cho, Suhwan
    Lee, Heansung
    Lee, Minhyeok
    Park, Chaewon
    Jang, Sungjun
    Kim, Minjung
    Lee, Sangyoun
    [J]. COMPUTER VISION, ECCV 2022, PT XXII, 2022, 13682 : 446 - 462
  • [8] Cho SH, 2022, Arxiv, DOI arXiv:2209.03139
  • [9] Comport A. I., 2004, 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No.04CH37566), P692
  • [10] BoxSup: Exploiting Bounding Boxes to Supervise Convolutional Networks for Semantic Segmentation
    Dai, Jifeng
    He, Kaiming
    Sun, Jian
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 1635 - 1643