A Fast Video Object Segmentation Method Based on Inductive Learning and Transductive Reasoning

被引:0
作者
Xu K. [1 ]
Li G.-R. [1 ]
Hong D.-X. [1 ]
Zhang W.-G. [2 ]
Qi Y.-K. [2 ]
Huang Q.-M. [1 ]
机构
[1] School of Computer Science and Technology, University of Chinese Academy of Sciences, Beijing
[2] School of Computer Science and Technology, Harbin Institute of Technology, Weihai
来源
Jisuanji Xuebao/Chinese Journal of Computers | 2022年 / 45卷 / 10期
基金
中国国家自然科学基金;
关键词
Attention mechanism; Online learning; Self-supervised learning; Video object segmentation; Video prediction;
D O I
10.11897/SP.J.1016.2022.02117
中图分类号
学科分类号
摘要
With the development of the multimedia and Internet technology, people can easily shoot a large number of video and photos through camera equipment and upload to the Internet now. The computer vision is a very important research field, whose goal is to make the computers to understand the content of video like humans. Video Object Segmentation(VOS) is one of the basic tasks in computer vision. The video object segmentation task aims to automatically obtain the pixel-level mask corresponding to the object of interest in the video sequence by the algorithm. Video object segmentation has various important applications, including automatic driving, video surveillance, video editing, video understanding and so on. Video object segmentation is a highly challenging problem since appearance change, scale change, target occlusion, non-rigid shape change, background interference etc. The existing methods can be divided into two categories according to the different utilization of the real label of the first frame of a given video. One is based on online inductive learning, the other is based on transductive reasoning learning. Previous methods based on online inductive learning finetune the whole network on the given initial frame segmentation mask in the inference stage in order to obtain accurate results, resulting in large time consumption and difficult to meet the real-time requirements. In addition, previous methods based on transductive reasoning need to use a large amount of synthetic data or annotation data when modeling video temporal reasoning rules, which increases the cost of algorithm training. In order to make full use of the advantages of two kinds of algorithms based on online inductive learning and transductive reasoning, and avoid the disadvantages of the two methods, a fast video object segmentation method combining online inductive learning and transductive reasoning is proposed in this paper. Specifically, the transductive reasoning branch is pre-trained by the video prediction self-supervised learning method. It can model the short-term temporal transformation and motion information of the video through the images and segmentation results of the few previous frames, so as to infer the segmentation results of the current frame. The learned temporal features can guide the network to improve the stability of video segmentation. In the pre-training process of the transductive reasoning branch, only the original video data need to be used without any additional synthetic data or manual annotation. The online inductive learning branch is trained online according to the reference frame to learn the appearance discriminant feature of the target and provide long-term appearance discriminant power. In order to improve the test speed, different from the previous methods, the inductive learning branch proposed in this paper does not use the first frame to fine tune the whole network online, but updates a very lightweight template network online to make the template network produce a rough object segmentation mask. This rough segmentation mask is used as the attention map of attention mechanism, and the final segmentation result is generated through the decoding module together with the temporal features and the image features of the current frame. Several experiments are carried out on three challenging datasets (i.e., DAVIS-2016, DAVIS-2017 and YouTube-VOS) to show that our method achieves competitive performance against the state-of-the-arts. © 2022, Science Press. All right reserved.
引用
收藏
页码:2117 / 2132
页数:15
相关论文
共 40 条
  • [1] Ros G, Ramos S, Granados M, Et al., Vision-based offline-online perception paradigm for autonomous driving, Proceedings of the 2015 IEEE Winter Conference on Applications of Computer Vision, pp. 231-238, (2015)
  • [2] Saleh K, Hossny M, Nahavandi S., Kangaroo vehicle collision detection using deep semantic segmentation convolutional neural network, Proceedings of the 2016 International Conference on Digital Image Computing: Techniques and Applications, pp. 1-7, (2016)
  • [3] Erdelyi A, Barat T, Valet P, Et al., Adaptive cartooning for privacy protection in camera networks, Proceedings of the 2014 11th IEEE International Conference on Advanced Video and Signal Based Surveillance, pp. 44-49, (2014)
  • [4] Liu Li-Jie, Cai De-Jun, Weng Nan-Shan, A motion-oriented video object segmentation algorithm, Chinese Journal of Computers, 23, 12, pp. 1326-1231, (2000)
  • [5] Chen Jia, Chen Ya-Song, Li Wei-Hao, Et al., Application and prospect of deep learning in video object segmentation, Chinese Journal of Computers, 44, 3, pp. 609-631, (2021)
  • [6] Caelles S, Maninis K K, Pont-Tuset J, Et al., One-shot video object segmentation, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 221-230, (2017)
  • [7] Maninis K K, Caelles S, Chen Y, Et al., Video object segmentation without temporal information, IEEE Transactions on Pattern Analysis and Machine Intelligence, 41, 6, pp. 1515-1530, (2018)
  • [8] Voigtlaender P, Leibe B., Online adaptation of convolutional neural networks for video object segmentation, Proceedings of the British Machine Vision Conference, pp. 1-13, (2017)
  • [9] Perazzi F, Khoreva A, Benenson R, Et al., Learning video object segmentation from static images, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2663-2672, (2017)
  • [10] Xu N, Yang L, Fan Y, Et al., YouTube-VOS: Sequence-to-sequence video object segmentation, Proceedings of the European Conference on Computer Vision, pp. 585-601, (2018)