Pyramid-structured multi-scale transformer for efficient semi-supervised video object segmentation with adaptive fusion

被引:0
作者
Zhang, Yunzuo [1 ]
Yu, Puze [1 ]
Xiao, Yaoge [1 ]
Wang, Shuangshuang [1 ]
机构
[1] Shijiazhuang Tiedao Univ, Sch Informat Sci & Technol, Shijiazhuang 050043, Peoples R China
基金
中国国家自然科学基金;
关键词
Video object segmentation; Real-time video segmentation; Pyramid structure;
D O I
10.1016/j.patrec.2025.04.027
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In recent years, Transformer-based methods have demonstrated promising performance in the field of semi-supervised video object segmentation. However, these methods require the maintenance of a memory frame from memory bank, which leads to an exponential increase in GPU memory requirements as the length of the video increases, necessitating updates of the memory bank every few frames. We propose a novel approach based on a multi-scale pyramid structure for object association with transformers, which can effectively encode both global and local features at different granularity levels, while significantly reducing GPU memory requirements as video length increases, thus maintaining high inference speed. To effectively integrate multi-scale ID embeddings and video frame embeddings, rather than simply overlaying the original features through addition, we have designed an adaptive fusion module to address this issue. We conducted extensive experiments on four commonly used VOS benchmarks (including YouTube-VOS 2018 and 2019 Val, DAVIS-2017,and LVOS), evaluating various variants of AOT. Our method outperformed state-of-the-art competitors and consistently demonstrated superior efficiency and scalability across all four benchmark tests.
引用
收藏
页码:48 / 54
页数:7
相关论文
共 35 条
[1]   Learning What to Learn for Video Object Segmentation [J].
Bhat, Goutam ;
Lawin, Felix Jaremo ;
Danelljan, Martin ;
Robinson, Andreas ;
Felsberg, Michael ;
Van Gool, Luc ;
Timofte, Radu .
COMPUTER VISION - ECCV 2020, PT II, 2020, 12347 :777-794
[2]   Blazingly Fast Video Object Segmentation with Pixel-Wise Metric Learning [J].
Chen, Yuhua ;
Pont-Tuset, Jordi ;
Montes, Alberto ;
Van Gool, Luc .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :1189-1198
[3]  
Cheng HK, 2021, ADV NEUR IN, V34
[4]   Tracking Anything with Decoupled Video Segmentation [J].
Cheng, Ho Kei ;
Oh, Seoung Wug ;
Price, Brian ;
Schwing, Alexander ;
Lee, Joon-Young .
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, :1316-1326
[5]   XMem: Long-Term Video Object Segmentation with an Atkinson-Shiffrin Memory Model [J].
Cheng, Ho Kei ;
Schwing, Alexander G. .
COMPUTER VISION - ECCV 2022, PT XXVIII, 2022, 13688 :640-658
[6]   Modular Interactive Video Object Segmentation: Interaction-to-Mask, Propagation and Difference-Aware Fusion [J].
Cheng, Ho Kei ;
Tai, Yu-Wing ;
Tang, Chi-Keung .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :5555-5564
[7]   SSTVOS: Sparse Spatiotemporal Transformers for Video Object Segmentation [J].
Duke, Brendan ;
Ahmed, Abdalla ;
Wolf, Christian ;
Aarabi, Parham ;
Taylor, Graham W. .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :5908-5917
[8]   Graph-based hierarchical video segmentation based on a simple dissimilarity measure [J].
Ferreira de Souza, Kleber Jacques ;
Araujo, Arnaldo de Albuquerque ;
do Patrocinio, Zenilton K. G., Jr. ;
Guimaraes, Silvio Jamil F. .
PATTERN RECOGNITION LETTERS, 2014, 47 :85-92
[9]   CLUE: Contrastive language-guided learning for referring video object segmentation [J].
Gao, Qiqi ;
Zhong, Wanjun ;
Li, Jie ;
Zhao, Tiejun .
PATTERN RECOGNITION LETTERS, 2024, 178 :115-121
[10]   LVOS: A Benchmark for Long-term Video Object Segmentation [J].
Hong, Lingyi ;
Chen, Wenchao ;
Liu, Zhongying ;
Zhang, Wei ;
Guo, Pinxue ;
Chen, Zhaoyu ;
Zhang, Wenqiang .
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, :13434-13446