A Fast Video Object Segmentation Method Based on Inductive Learning and Transductive Reasoning

被引:0
作者
Xu K. [1 ]
Li G.-R. [1 ]
Hong D.-X. [1 ]
Zhang W.-G. [2 ]
Qi Y.-K. [2 ]
Huang Q.-M. [1 ]
机构
[1] School of Computer Science and Technology, University of Chinese Academy of Sciences, Beijing
[2] School of Computer Science and Technology, Harbin Institute of Technology, Weihai
来源
Jisuanji Xuebao/Chinese Journal of Computers | 2022年 / 45卷 / 10期
基金
中国国家自然科学基金;
关键词
Attention mechanism; Online learning; Self-supervised learning; Video object segmentation; Video prediction;
D O I
10.11897/SP.J.1016.2022.02117
中图分类号
学科分类号
摘要
With the development of the multimedia and Internet technology, people can easily shoot a large number of video and photos through camera equipment and upload to the Internet now. The computer vision is a very important research field, whose goal is to make the computers to understand the content of video like humans. Video Object Segmentation(VOS) is one of the basic tasks in computer vision. The video object segmentation task aims to automatically obtain the pixel-level mask corresponding to the object of interest in the video sequence by the algorithm. Video object segmentation has various important applications, including automatic driving, video surveillance, video editing, video understanding and so on. Video object segmentation is a highly challenging problem since appearance change, scale change, target occlusion, non-rigid shape change, background interference etc. The existing methods can be divided into two categories according to the different utilization of the real label of the first frame of a given video. One is based on online inductive learning, the other is based on transductive reasoning learning. Previous methods based on online inductive learning finetune the whole network on the given initial frame segmentation mask in the inference stage in order to obtain accurate results, resulting in large time consumption and difficult to meet the real-time requirements. In addition, previous methods based on transductive reasoning need to use a large amount of synthetic data or annotation data when modeling video temporal reasoning rules, which increases the cost of algorithm training. In order to make full use of the advantages of two kinds of algorithms based on online inductive learning and transductive reasoning, and avoid the disadvantages of the two methods, a fast video object segmentation method combining online inductive learning and transductive reasoning is proposed in this paper. Specifically, the transductive reasoning branch is pre-trained by the video prediction self-supervised learning method. It can model the short-term temporal transformation and motion information of the video through the images and segmentation results of the few previous frames, so as to infer the segmentation results of the current frame. The learned temporal features can guide the network to improve the stability of video segmentation. In the pre-training process of the transductive reasoning branch, only the original video data need to be used without any additional synthetic data or manual annotation. The online inductive learning branch is trained online according to the reference frame to learn the appearance discriminant feature of the target and provide long-term appearance discriminant power. In order to improve the test speed, different from the previous methods, the inductive learning branch proposed in this paper does not use the first frame to fine tune the whole network online, but updates a very lightweight template network online to make the template network produce a rough object segmentation mask. This rough segmentation mask is used as the attention map of attention mechanism, and the final segmentation result is generated through the decoding module together with the temporal features and the image features of the current frame. Several experiments are carried out on three challenging datasets (i.e., DAVIS-2016, DAVIS-2017 and YouTube-VOS) to show that our method achieves competitive performance against the state-of-the-arts. © 2022, Science Press. All right reserved.
引用
收藏
页码:2117 / 2132
页数:15
相关论文
共 40 条
  • [31] Zhao H, Shi J, Qi X, Et al., Pyramid scene parsing network, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2881-2890, (2017)
  • [32] Russakovsky O, Deng J, Su H, Et al., ImageNet large scale visual recognition challenge, International Journal of Computer Vision, 115, 3, pp. 211-252, (2015)
  • [33] Danelljan M, Bhat G, Khan F S, Et al., ATOM: Accurate tracking by overlap maximization, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4660-4669, (2019)
  • [34] Telea A., An image inpainting technique based on the fast marching method, Journal of Graphics Tools, 9, 1, pp. 23-34, (2004)
  • [35] Pont-Tuset J, Perazzi F, Caelles S, Et al., The 2017 DAVIS challenge on video object segmentation, (2017)
  • [36] Perazzi F, Pont-Tuset J, McWilliams B, Et al., A benchmark dataset and evaluation methodology for video object segmentation, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 724-732, (2016)
  • [37] Lin H, Qi X, Jia J., AGSS-VOS: Attention guided single-shot video object segmentation, Proceedings of the IEEE International Conference on Computer Vision, pp. 3949-3957, (2019)
  • [38] Cheng J, Tsai Y H, Hung W C, Et al., Fast and accurate online video object segmentation via tracking parts, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7415-7424, (2018)
  • [39] Xu K, Wen L, Li G, Et al., Spatiotemporal cnn for video object segmentation, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1379-1388, (2019)
  • [40] Ventura C, Bellver M, Girbau A, Et al., RVOS: End-to-end recurrent network for video object segmentation, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5277-5286, (2019)