A Spatiotemporal Mask Autoencoder for One-shot Video Object Segmentation

被引:1
作者
Chen, Baiyu [1 ]
Zhao, Li [1 ]
Chan, Sixian [2 ]
机构
[1] Wenzhou Univ, Key Lab Intelligent Informat Safety & Emergency Z, Wenzhou, Peoples R China
[2] Zhejiang Univ Technol, Coll Comp Sci & Technol, Hangzhou, Peoples R China
来源
PROCEEDINGS OF 2024 3RD INTERNATIONAL CONFERENCE ON FRONTIERS OF ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING, FAIML 2024 | 2024年
基金
中国国家自然科学基金;
关键词
video object segmentation; weak supervision; autoencoder;
D O I
10.1145/3653644.3653658
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper introduces a novel architecture for the video object segmentation (VOS) challenge to achieve greater label efficiency. Previous studies have primarily tackled this problem through either match-based or propagate-based architectures, relying on fully annotated datasets. In contrast, we propose the spatiotemporal mask autoencoder (STMAE), a novel VOS architecture constructed using annotations solely from the first frame. Specifically, STMAE generates a precise mask by initially aggregating a coarse mask from previous frames based on visual correspondence provided by an image encoder and then reconstructing it. We further propose a one-shot training strategy to learn general object representations for VOS using only the first frame mask. This strategy incorporates a reconstruction loss that guides the network to reconstruct the first frame mask from the spatiotemporal aggregation. Finally, extensive experiments conducted on the DAVIS and YouTube-VOS datasets demonstrate that STMAE achieves remarkable performance while effectively addressing the labor-intensive annotation issue.
引用
收藏
页码:6 / 12
页数:7
相关论文
共 36 条
[1]  
Araslanov N, 2021, ADV NEUR IN, V34
[2]   Learning What to Learn for Video Object Segmentation [J].
Bhat, Goutam ;
Lawin, Felix Jaremo ;
Danelljan, Martin ;
Robinson, Andreas ;
Felsberg, Michael ;
Van Gool, Luc ;
Timofte, Radu .
COMPUTER VISION - ECCV 2020, PT II, 2020, 12347 :777-794
[3]   One-Shot Video Object Segmentation [J].
Caelles, S. ;
Maninis, K. -K. ;
Pont-Tuset, J. ;
Leal-Taixe, L. ;
Cremers, D. ;
Van Gool, L. .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :5320-5329
[4]   Asymmetric Cascade Fusion Network for Building Extraction [J].
Chan, Sixian ;
Wang, Yuan ;
Lei, Yanjing ;
Cheng, Xu ;
Chen, Zhaomin ;
Wu, Wei .
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2023, 61
[5]   Res2-UNeXt: a novel deep learning framework for few-shot cell image segmentation [J].
Chan, Sixian ;
Huang, Cheng ;
Bai, Cong ;
Ding, Weilong ;
Chen, Shengyong .
MULTIMEDIA TOOLS AND APPLICATIONS, 2022, 81 (10) :13275-13288
[6]   State-Aware Tracker for Real-Time Video Object Segmentation [J].
Chen, Xi ;
Li, Zuoxin ;
Yuan, Ye ;
Yu, Gang ;
Shen, Jianxin ;
Qi, Donglian .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :9381-9390
[7]   Blazingly Fast Video Object Segmentation with Pixel-Wise Metric Learning [J].
Chen, Yuhua ;
Pont-Tuset, Jordi ;
Montes, Alberto ;
Van Gool, Luc .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :1189-1198
[8]  
Cheng HK, 2021, ADV NEUR IN, V34
[9]   Modular Interactive Video Object Segmentation: Interaction-to-Mask, Propagation and Difference-Aware Fusion [J].
Cheng, Ho Kei ;
Tai, Yu-Wing ;
Tang, Chi-Keung .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :5555-5564
[10]   Masked Autoencoders Are Scalable Vision Learners [J].
He, Kaiming ;
Chen, Xinlei ;
Xie, Saining ;
Li, Yanghao ;
Dollar, Piotr ;
Girshick, Ross .
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :15979-15988