F2Net: Learning to Focus on the Foreground for Unsupervised Video Object Segmentation

被引:0
作者
Liu, Daizong [1 ]
Yu, Dongdong [2 ]
Wang, Changhu [2 ]
Zhou, Pan [1 ]
机构
[1] Huazhong Univ Sci & Technol, Wuhan, Peoples R China
[2] ByteDance AI Lab, Wuhan, Peoples R China
来源
THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE | 2021年 / 35卷
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Although deep learning based methods have achieved great progress in unsupervised video object segmentation, difficult scenarios (e.g., visual similarity, occlusions, and appearance changing) are still not well-handled. To alleviate these issues, we propose a novel Focus on Foreground Network (F2Net), which delves into the intra-inter frame details for the foreground objects and thus effectively improve the segmentation performance. Specifically, our proposed network consists of three main parts: Siamese Encoder Module, Center Guiding Appearance Diffusion Module, and Dynamic Information Fusion Module. Firstly, we take a siamese encoder to extract the feature representations of paired frames (reference frame and current frame). Then, a Center Guiding Appearance Diffusion Module is designed to capture the inter-frame feature (dense correspondences between reference frame and current frame), intra-frame feature (dense correspondences in current frame), and original semantic feature of current frame. Specifically, we establish a Center Prediction Branch to predict the center location of the foreground object in current frame and leverage the center point information as spatial guidance prior to enhance the inter-frame and intra-frame feature extraction, and thus the feature representation considerably focus on the foreground objects. Finally, we propose a Dynamic Information Fusion Module to automatically select relatively important features through three aforementioned different level features. Extensive experiments on DAVIS2016, Youtube-object, and FBMS datasets show that our proposed F2Net achieves the state-of-the-art performance with significant improvement.
引用
收藏
页码:2109 / 2117
页数:9
相关论文
共 41 条
[1]  
[Anonymous], pervised Semantic Segmentation via Adversarial Learning
[2]  
[Anonymous], 2018, ECCV, DOI DOI 10.1145/3241539.3241552
[3]  
[Anonymous], 2018, CVPR, DOI DOI 10.1109/CVPR.2018.00683
[4]  
Chen LB, 2017, IEEE INT SYMP NANO, P1, DOI 10.1109/NANOARCH.2017.8053709
[5]   Blazingly Fast Video Object Segmentation with Pixel-Wise Metric Learning [J].
Chen, Yuhua ;
Pont-Tuset, Jordi ;
Montes, Alberto ;
Van Gool, Luc .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :1189-1198
[6]   SegFlow: Joint Learning for Video Object Segmentation and Optical Flow [J].
Cheng, Jingchun ;
Tsai, Yi-Hsuan ;
Wang, Shengjin ;
Yang, Ming-Hsuan .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :686-695
[7]   Global Contrast based Salient Region Detection [J].
Cheng, Ming-Ming ;
Zhang, Guo-Xin ;
Mitra, Niloy J. ;
Huang, Xiaolei ;
Hu, Shi-Min .
2011 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2011, :409-416
[8]   Multi-Context Attention for Human Pose Estimation [J].
Chu, Xiao ;
Yang, Wei ;
Ouyang, Wanli ;
Ma, Cheng ;
Yuille, Alan L. ;
Wang, Xiaogang .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :5669-5678
[9]  
Faisal Muhammad, 2019, ARXIV190913258
[10]  
Faktor Alon, 2014, BRIT MACH VIS C BMVC