Tracking Anything with Decoupled Video Segmentation

被引:26
作者
Cheng, Ho Kei [1 ]
Oh, Seoung Wug [2 ]
Price, Brian [2 ]
Schwing, Alexander [1 ]
Lee, Joon-Young [2 ]
机构
[1] Univ Illinois, Urbana, IL 61801 USA
[2] Adobe Res, San Francisco, CA USA
来源
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV | 2023年
关键词
D O I
10.1109/ICCV51070.2023.00127
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Training data for video segmentation are expensive to annotate. This impedes extensions of end-to-end algorithms to new video segmentation tasks, especially in large-vocabulary settings. To 'track anything' without training on video data for every individual task, we develop a decoupled video segmentation approach (DEVA), composed of task-specific image-level segmentation and class/task-agnostic bi-directional temporal propagation. Due to this design, we only need an image-level model for the target task (which is cheaper to train) and a universal temporal propagation model which is trained once and generalizes across tasks. To effectively combine these two modules, we use bi-directional propagation for (semi-)online fusion of segmentation hypotheses from different frames to generate a coherent segmentation. We show that this decoupled formulation compares favorably to end-to-end approaches in several data-scarce tasks including large-vocabulary video panoptic segmentation, open-world video segmentation, referring video segmentation, and unsupervised video object segmentation. Code is available at: hkchengrex.github.io/Tracking-Anything-with-DEVA.
引用
收藏
页码:1316 / 1326
页数:11
相关论文
共 63 条
  • [1] [Anonymous], 2017, CVPR, DOI DOI 10.1109/CVPR.2017.394
  • [2] [Anonymous], 2021, CVPR, DOI DOI 10.1109/CVPR46437.2021.00863
  • [3] Athar Ali, 2023, ARXIV230102657
  • [4] Athar Ali, 2023, WACV
  • [5] Tracking without bells and whistles
    Bergmann, Philipp
    Meinhardt, Tim
    Leal-Taixe, Laura
    [J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 941 - 951
  • [6] Classifying, Segmenting, and Tracking Object Instances in Video with Mask Propagation
    Bertasius, Gedas
    Torresani, Lorenzo
    [J]. 2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, : 9736 - 9745
  • [7] Caelles S., 2019, ARXIV190500737
  • [8] Cheng Bowen, 2022, CVPR, V7
  • [9] Cheng Bowen, 2021, MASK2FORMER VIDEO IN
  • [10] Cheng H. K., 2022, ECCV