A Spatiotemporal Mask Autoencoder for One-shot Video Object Segmentation

被引：1

作者：

Chen, Baiyu ^{[1
]}

Zhao, Li ^{[1
]}

Chan, Sixian ^{[2
]}

机构：

[1] Wenzhou Univ, Key Lab Intelligent Informat Safety & Emergency Z, Wenzhou, Peoples R China

[2] Zhejiang Univ Technol, Coll Comp Sci & Technol, Hangzhou, Peoples R China

来源：

PROCEEDINGS OF 2024 3RD INTERNATIONAL CONFERENCE ON FRONTIERS OF ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING, FAIML 2024 | 2024年

基金：

中国国家自然科学基金;

关键词：

video object segmentation; weak supervision; autoencoder;

D O I：

10.1145/3653644.3653658

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

This paper introduces a novel architecture for the video object segmentation (VOS) challenge to achieve greater label efficiency. Previous studies have primarily tackled this problem through either match-based or propagate-based architectures, relying on fully annotated datasets. In contrast, we propose the spatiotemporal mask autoencoder (STMAE), a novel VOS architecture constructed using annotations solely from the first frame. Specifically, STMAE generates a precise mask by initially aggregating a coarse mask from previous frames based on visual correspondence provided by an image encoder and then reconstructing it. We further propose a one-shot training strategy to learn general object representations for VOS using only the first frame mask. This strategy incorporates a reconstruction loss that guides the network to reconstruct the first frame mask from the spatiotemporal aggregation. Finally, extensive experiments conducted on the DAVIS and YouTube-VOS datasets demonstrate that STMAE achieves remarkable performance while effectively addressing the labor-intensive annotation issue.

引用

页码：6 / 12

页数：7

共 36 条

[1]

Araslanov N, 2021, ADV NEUR IN, V34

[2] Learning What to Learn for Video Object Segmentation [J].

Bhat, Goutam ;

Lawin, Felix Jaremo ;

Danelljan, Martin ;

Robinson, Andreas ;

Felsberg, Michael ;

Van Gool, Luc ;

Timofte, Radu .

COMPUTER VISION - ECCV 2020, PT II, 2020, 12347 :777-794

[3] One-Shot Video Object Segmentation [J].

Caelles, S. ;

Maninis, K. -K. ;

Pont-Tuset, J. ;

Leal-Taixe, L. ;

Cremers, D. ;

Van Gool, L. .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :5320-5329

[4] Asymmetric Cascade Fusion Network for Building Extraction [J].

Chan, Sixian ;

Wang, Yuan ;

Lei, Yanjing ;

Cheng, Xu ;

Chen, Zhaomin ;

Wu, Wei .

IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2023, 61

[5] Res2-UNeXt: a novel deep learning framework for few-shot cell image segmentation [J].

Chan, Sixian ;

Huang, Cheng ;

Bai, Cong ;

Ding, Weilong ;

Chen, Shengyong .

MULTIMEDIA TOOLS AND APPLICATIONS, 2022, 81 (10) :13275-13288

[6] State-Aware Tracker for Real-Time Video Object Segmentation [J].

Chen, Xi ;

Li, Zuoxin ;

Yuan, Ye ;

Yu, Gang ;

Shen, Jianxin ;

Qi, Donglian .

2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :9381-9390

[7] Blazingly Fast Video Object Segmentation with Pixel-Wise Metric Learning [J].

Chen, Yuhua ;

Pont-Tuset, Jordi ;

Montes, Alberto ;

Van Gool, Luc .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :1189-1198

[8]

Cheng HK, 2021, ADV NEUR IN, V34

[9] Modular Interactive Video Object Segmentation: Interaction-to-Mask, Propagation and Difference-Aware Fusion [J].

Cheng, Ho Kei ;

Tai, Yu-Wing ;

Tang, Chi-Keung .

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :5555-5564

[10] Masked Autoencoders Are Scalable Vision Learners [J].

He, Kaiming ;

Chen, Xinlei ;

Xie, Saining ;

Li, Yanghao ;

Dollar, Piotr ;

Girshick, Ross .

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :15979-15988

← 1 2 3 4 →