A Reinforcement Learning Based Adaptive ROI Generation for Video Object Segmentation

被引:5
作者
Usmani, Usman Ahmad [1 ]
Watada, Junzo [2 ]
Jaafar, Jafreezal [1 ]
Aziz, Izzatdin Abdul [1 ]
Roy, Arunava [3 ]
机构
[1] Univ Teknol PETRONAS, Fac Sci & IT, Dept Comp & Informat Sci, Seri Iskandar 32610, Perak, Malaysia
[2] Waseda Univ, Grad Sch Informat, Prod & Syst, Kitakyushu, Fukuoka 8080135, Japan
[3] Monash Univ Malaysia, Sch Informat Technol, Dept Comp Sci, Subang Jaya 47500, Selangor, Malaysia
关键词
Motion segmentation; Correlation; Feature extraction; Object segmentation; Computational modeling; Reinforcement learning; Training; Model adaptation; object detection; object tracking; reinforcement learning; video object segmentation; EXTRACTION; ATTENTION;
D O I
10.1109/ACCESS.2021.3132453
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Video object segmentation's primary goal is to automatically extract the principal object(s) in the foreground from the background in videos. The primary focus of the current deep learning-based models is to learn the discriminative representations in the foreground over motion and appearance in small-term temporal segments. In the video segmentation process, it is difficult to handle various challenges such as deformation, scale variation, motion blur, and occlusion. Furthermore, relocating the segmentation target in the next frame is difficult if it is lost in the current frame during the segmentation process. This work aims at solving the zero-shot video object segmentation issue in a holistic fashion. We take advantage of the inherent correlations between the video frames by incorporating a global co-attention mechanism to overcome the limitations. We propose a novel reinforcement learning framework that provides competent and fast stages for gathering scene context and global correlations. The agent concurrently calculates and adds the responses of co-attention in the joint feature space. To capture the different aspects of the common feature space, the agent can generate multiple co-attention versions. Our framework is trained using pairs (or groups) of video frames, which adds to the training content, thus increasing the learning capacity. Our approach encodes the important information during the segmentation phase by a simultaneous process of various reference frames that are subsequently utilized to predict the persistent and conspicuous objects in the foreground. The proposed method has been validated using four commonly used video entity segmentation datasets: SegTrack V2, DAVIS 2016, CdNet 2014, and the Youtube-Object dataset. On the DAVIS 2016, the results reveal that the proposed results boost the state-of-the-art techniques on the F1 Measure by 4%, SegTrack V2 by a Jaccard Index of 12.03%, and Youtube Object by a Jaccard Index of 13.11%. Meanwhile, our algorithm improves the accuracy by 8%, F1 Measure by 12.25 %, and precision by 14% on the CdNet 2014, thus ranking higher than the current state-of-the-art methods.
引用
收藏
页码:161959 / 161977
页数:19
相关论文
共 94 条
  • [1] VNet: An End-to-End Fully Convolutional Neural Network for Road Extraction From High-Resolution Remote Sensing Data
    Abdollahi, Abolfazl
    Pradhan, Biswajeet
    Alamri, Abdullah
    [J]. IEEE ACCESS, 2020, 8 : 179424 - 179436
  • [2] Fast Temporal Video Segmentation Based on Krawtchouk-Tchebichef Moments
    Abdulhussain, Sadiq H.
    Al-Haddad, Syed Abdul Rahman
    Saripan, M. Iqbal
    Mahmmod, Basheera M.
    Hussien, Aseel
    [J]. IEEE ACCESS, 2020, 8 : 72347 - 72359
  • [3] EFIC: Edge Based Foreground Background Segmentation and Interior Classification for Dynamic Camera Viewpoints
    Allebosch, Gianni
    Deboeverie, Francis
    Veelaert, Peter
    Philips, Wilfried
    [J]. ADVANCED CONCEPTS FOR INTELLIGENT VISION SYSTEMS, ACIVS 2015, 2015, 9386 : 130 - 141
  • [4] Allebosch Gianni., 2015, Computer Vision, Imaging and Computer Graphics Theory and Applications, P433
  • [5] SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation
    Badrinarayanan, Vijay
    Kendall, Alex
    Cipolla, Roberto
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2017, 39 (12) : 2481 - 2495
  • [6] Baldi Pierre., 2012, P ICML WORKSH UNS TR, P37
  • [7] CNN in MRF: Video Object Segmentation via Inference in A CNN-Based Higher-Order Spatio-Temporal MRF
    Bao, Linchao
    Wu, Baoyuan
    Liu, Wei
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 5977 - 5986
  • [8] A Shot boundary Detection Technique based on Visual Colour Information
    Chakraborty, Saptarshi
    Thounaojam, Dalton Meitei
    Sinha, Nidul
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2021, 80 (03) : 4007 - 4022
  • [9] SBD-Duo: a dual stage shot boundary detection technique robust to motion and illumination effect
    Chakraborty, Saptarshi
    Thounaojam, Dalton Meitei
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2021, 80 (02) : 3071 - 3087
  • [10] A Video Representation Using Temporal Superpixels
    Chang, Jason
    Wei, Donglai
    Fisher, John W., III
    [J]. 2013 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2013, : 2051 - 2058