Video Object Segmentation Using Multi-Scale Attention-Based Siamese Network

被引:1
作者
Zhu, Zhiliang [1 ]
Qiu, Leiningxin [1 ]
Wang, Jiaxin [2 ]
Xiong, Jinquan [3 ]
Peng, Hua [2 ]
机构
[1] East China Jiaotong Univ, Sch Software, Nanchang 330013, Peoples R China
[2] Chinese Acad Sci, Inst Software, State Key Lab Comp Sci, Beijing 100190, Peoples R China
[3] Nanchang Normal Univ, Dept Math & Comp Sci, Nanchang 330032, Peoples R China
关键词
video object segmentation; object detection; deep learning; Siamese neural network; attention mechanism;
D O I
10.3390/electronics12132890
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Video target segmentation is a fundamental problem in computer vision that aims to segment targets from a background by learning their appearance information and movement information. In this study, a video target segmentation network based on the Siamese structure was proposed. This network has two inputs: the current video frame, used as the main input, and the adjacent frame, used as the auxiliary input. The processing modules for the inputs use the same structure, optimization strategy, and encoder weights. The input is encoded to obtain features with different resolutions, from which good target appearance features can be obtained. After processing using the encoding layer, the motion features of the target are learned using a multi-scale feature fusion decoder based on an attention mechanism. The final predicted segmentation results were calculated from a layer of decoded features. The video object segmentation framework proposed in this study achieved optimal results on CDNet2014 and FBMS-3D, with scores of 78.36 and 86.71, respectively. It outperformed the second-ranked method by 4.3 on the CDNet2014 dataset and by 0.77 on the FBMS-3D dataset. Suboptimal results were achieved on the video primary target segmentation datasets SegTrackV2 and DAVIS2016, with scores of 60.57 and 81.08, respectively.
引用
收藏
页数:14
相关论文
共 57 条
[1]  
[Anonymous], 2020, SEGM MOD PYT
[2]   CNN in MRF: Video Object Segmentation via Inference in A CNN-Based Higher-Order Spatio-Temporal MRF [J].
Bao, Linchao ;
Wu, Baoyuan ;
Liu, Wei .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :5977-5986
[3]  
Bellver M, 2020, Arxiv, DOI arXiv:2010.00263
[4]  
Bideau P, 2016, Arxiv, DOI arXiv:1610.10033
[5]   It's Moving! A Probabilistic Model for Causal Motion Segmentation in Moving Camera Videos [J].
Bideau, Pia ;
Learned-Miller, Erik .
COMPUTER VISION - ECCV 2016, PT VIII, 2016, 9912 :433-449
[6]   End-to-End Referring Video Object Segmentation with Multimodal Transformers [J].
Botach, Adam ;
Zheltonozhskii, Evgenii ;
Baskin, Chaim .
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :4975-4985
[7]  
Brox T, 2010, LECT NOTES COMPUT SC, V6315, P282, DOI 10.1007/978-3-642-15555-0_21
[8]  
Chaurasia A, 2017, 2017 IEEE VISUAL COMMUNICATIONS AND IMAGE PROCESSING (VCIP)
[9]   Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation [J].
Chen, Liang-Chieh ;
Zhu, Yukun ;
Papandreou, George ;
Schroff, Florian ;
Adam, Hartwig .
COMPUTER VISION - ECCV 2018, PT VII, 2018, 11211 :833-851
[10]   SCA-CNN: Spatial and Channel-wise Attention in Convolutional Networks for Image Captioning [J].
Chen, Long ;
Zhang, Hanwang ;
Xiao, Jun ;
Nie, Liqiang ;
Shao, Jian ;
Liu, Wei ;
Chua, Tat-Seng .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :6298-6306