Hierarchical Spatiotemporal Transformers for Video Object Segmentation

被引：2

作者：

Yoo, Jun-Sang ^{[1
]}

Lee, Hongjae ^{[1
]}

Jung, Seung-Won ^{[1
]}

机构：

[1] Korea Univ, Seoul, South Korea

来源：

2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW | 2023年

基金：

新加坡国家研究基金会;

关键词：

D O I：

10.1109/ICCVW60793.2023.00087

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

This paper presents a novel framework called HST for semi-supervised video object segmentation (VOS). HST extracts image and video features using the latest Swin Transformer and Video Swin Transformer to inherit their inductive bias for the spatiotemporal locality, which is essential for temporally coherent VOS. To take full advantage of the image and video features, HST casts image and video features as a query and memory, respectively. By applying efficient memory read operations at multiple scales, HST produces hierarchical features for the precise reconstruction of object masks. HST shows effectiveness and robustness in handling challenging scenarios with occluded and fast-moving objects under cluttered backgrounds. In particular, HST-B outperforms the state-of-the-art competitors on multiple popular benchmarks, i.e., YouTube-VOS (85.0%), DAVIS 2017 (85.9%), and DAVIS 2016 (94.0%).

引用

页码：795 / 805

页数：11

共 60 条

[1] Bahdanau D, 2016, Arxiv, DOI arXiv:1409.0473
[2] One-Shot Video Object Segmentation
Caelles, S.
Maninis, K. -K.
Pont-Tuset, J.
Leal-Taixe, L.
Cremers, D.
Van Gool, L.
[J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 5320 - 5329
[3] Multiple Temporal Fusion based Weakly-supervised Pre-training Techniques for Video Categorization
Cai, Xiaochen
Cai, Hengxing
Zhu, Boqing
Xu, Kele
Tu, Weiwei
Feng, Dawei
[J]. PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 7089 - 7093
[4] Carion Nicolas, 2020, Computer Vision - ECCV 2020. 16th European Conference. Proceedings. Lecture Notes in Computer Science (LNCS 12346), P213, DOI 10.1007/978-3-030-58452-8_13
[5] Cheng HK, 2021, ADV NEUR IN, V34
[6] Modular Interactive Video Object Segmentation: Interaction-to-Mask, Propagation and Difference-Aware Fusion
Cheng, Ho Kei
Tai, Yu-Wing
Tang, Chi-Keung
[J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 5555 - 5564
[7] Cheng Ming-Ming, 2014, Global contrast based salient region detection, V37, P569
[8] Cho Kyunghyun, 2014, EMNLP 2014 2014 C EM, P1724, DOI [DOI 10.3115/V1/D14-1179, 10.3115/v1/D14-1179]
[9] Learning Contextual Transformer Network for Image Inpainting
Deng, Ye
Hui, Siqi
Zhou, Sanping
Meng, Deyu
Wang, Jinjun
[J]. PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 2529 - 2538
[10] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171

← 1 2 3 4 5 6 →