Hierarchical Spatiotemporal Transformers for Video Object Segmentation

被引:2
作者
Yoo, Jun-Sang [1 ]
Lee, Hongjae [1 ]
Jung, Seung-Won [1 ]
机构
[1] Korea Univ, Seoul, South Korea
来源
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW | 2023年
基金
新加坡国家研究基金会;
关键词
D O I
10.1109/ICCVW60793.2023.00087
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper presents a novel framework called HST for semi-supervised video object segmentation (VOS). HST extracts image and video features using the latest Swin Transformer and Video Swin Transformer to inherit their inductive bias for the spatiotemporal locality, which is essential for temporally coherent VOS. To take full advantage of the image and video features, HST casts image and video features as a query and memory, respectively. By applying efficient memory read operations at multiple scales, HST produces hierarchical features for the precise reconstruction of object masks. HST shows effectiveness and robustness in handling challenging scenarios with occluded and fast-moving objects under cluttered backgrounds. In particular, HST-B outperforms the state-of-the-art competitors on multiple popular benchmarks, i.e., YouTube-VOS (85.0%), DAVIS 2017 (85.9%), and DAVIS 2016 (94.0%).
引用
收藏
页码:795 / 805
页数:11
相关论文
共 60 条
  • [1] Bahdanau D, 2016, Arxiv, DOI arXiv:1409.0473
  • [2] One-Shot Video Object Segmentation
    Caelles, S.
    Maninis, K. -K.
    Pont-Tuset, J.
    Leal-Taixe, L.
    Cremers, D.
    Van Gool, L.
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 5320 - 5329
  • [3] Multiple Temporal Fusion based Weakly-supervised Pre-training Techniques for Video Categorization
    Cai, Xiaochen
    Cai, Hengxing
    Zhu, Boqing
    Xu, Kele
    Tu, Weiwei
    Feng, Dawei
    [J]. PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 7089 - 7093
  • [4] Carion Nicolas, 2020, Computer Vision - ECCV 2020. 16th European Conference. Proceedings. Lecture Notes in Computer Science (LNCS 12346), P213, DOI 10.1007/978-3-030-58452-8_13
  • [5] Cheng HK, 2021, ADV NEUR IN, V34
  • [6] Modular Interactive Video Object Segmentation: Interaction-to-Mask, Propagation and Difference-Aware Fusion
    Cheng, Ho Kei
    Tai, Yu-Wing
    Tang, Chi-Keung
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 5555 - 5564
  • [7] Cheng Ming-Ming, 2014, Global contrast based salient region detection, V37, P569
  • [8] Cho Kyunghyun, 2014, EMNLP 2014 2014 C EM, P1724, DOI [DOI 10.3115/V1/D14-1179, 10.3115/v1/D14-1179]
  • [9] Learning Contextual Transformer Network for Image Inpainting
    Deng, Ye
    Hui, Siqi
    Zhou, Sanping
    Meng, Deyu
    Wang, Jinjun
    [J]. PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 2529 - 2538
  • [10] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171