Dual temporal memory network with high-order spatio-temporal graph learning for video object segmentation

被引:0
作者
Fan, Jiaqing [1 ]
Hu, Shenglong [2 ]
Wang, Long [3 ]
Zhang, Kaihua [2 ]
Liu, Bo [4 ]
机构
[1] Soochow Univ, Sch Comp Sci & Technol, Suzhou, Peoples R China
[2] Nanjing Univ Informat Sci & Technol, Sch Comp & Sci, Nanjing, Peoples R China
[3] Guodian Nanjing Automat Co Ltd, Nanjing, Peoples R China
[4] Walmart Global Tech, Sunnyvale, CA USA
关键词
Semi-supervised learning; Video object segmentation; Graph convolution networks; High-order graph learning; Direction-aware attention;
D O I
10.1016/j.imavis.2024.105208
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Typically, Video Object Segmentation (VOS) always has the semi-supervised setting in the testing phase. The VOS aims to track and segment one or several target objects in the following frames in the sequence, only given the ground-truth segmentation mask in the initial frame. A fundamental issue in VOS is how to best utilize the temporal information to improve the accuracy. To address the aforementioned issue, we provide an end-to-end framework that simultaneously extracts long-term and short-term historical sequential information to current frame as temporal memories. The integrated temporal architecture consists of a short-term and a long-term memory modules. Specifically, the short-term memory module leverages a high-order graph-based learning framework to simulate the fine-grained spatial-temporal interactions between local regions across neighboring frames in a video, thereby maintaining the spatio-temporal visual consistency on local regions. Meanwhile, to relieve the occlusion and drift issues, the long-term memory module employs a Simplified Gated Recurrent Unit (S-GRU) to model the long evolutions in a video. Furthermore, we design a novel direction-aware attention module to complementarily augment the object representation for more robust segmentation. Our experiments on three mainstream VOS benchmarks, containing DAVIS 2017, DAVIS 2016, and Youtube-VOS, demonstrate that our proposed solution provides a fair tradeoff performance between both speed and accuracy.
引用
收藏
页数:10
相关论文
共 54 条
  • [1] CNN in MRF: Video Object Segmentation via Inference in A CNN-Based Higher-Order Spatio-Temporal MRF
    Bao, Linchao
    Wu, Baoyuan
    Liu, Wei
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 5977 - 5986
  • [2] One-Shot Video Object Segmentation
    Caelles, S.
    Maninis, K. -K.
    Pont-Tuset, J.
    Leal-Taixe, L.
    Cremers, D.
    Van Gool, L.
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 5320 - 5329
  • [3] Chen X, 2020, PROC CVPR IEEE, P9381, DOI 10.1109/CVPR42600.2020.00940
  • [4] Modular Interactive Video Object Segmentation: Interaction-to-Mask, Propagation and Difference-Aware Fusion
    Cheng, Ho Kei
    Tai, Yu-Wing
    Tang, Chi-Keung
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 5555 - 5564
  • [5] SegFlow: Joint Learning for Video Object Segmentation and Optical Flow
    Cheng, Jingchun
    Tsai, Yi-Hsuan
    Wang, Shengjin
    Yang, Ming-Hsuan
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 686 - 695
  • [6] Cho K., 2014, P 2014 C EMPIRICAL M, P1724, DOI [DOI 10.3115/V1/D14-1179, 10.3115/v1/D14-1179]
  • [7] Pixel-Level Bijective Matching for Video Object Segmentation
    Cho, Suhwan
    Lee, Heansung
    Kim, Minjung
    Jang, Sungjun
    Lee, Sangyoun
    [J]. 2022 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2022), 2022, : 1453 - 1462
  • [8] CapsuleVOS: Semi-Supervised Video Object Segmentation Using Capsule Routing
    Duarte, Kevin
    Rawat, Yogesh S.
    Shah, Mubarak
    [J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 8479 - 8488
  • [9] SSTVOS: Sparse Spatiotemporal Transformers for Video Object Segmentation
    Duke, Brendan
    Ahmed, Abdalla
    Wolf, Christian
    Aarabi, Parham
    Taylor, Graham W.
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 5908 - 5917
  • [10] Video Object Segmentation Using Global and Instance Embedding Learning
    Ge, Wenbin
    Lu, Xiankai
    Shen, Jianbing
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 16831 - 16840