SG-Net: Spatial Granularity Network for One-Stage Video Instance Segmentation

被引:148
作者
Liu, Dongfang [1 ]
Cui, Yiming [2 ]
Tan, Wenbo [3 ]
Chen, Yingjie [1 ]
机构
[1] Purdue Univ, W Lafayette, IN 47907 USA
[2] Univ Florida, Gainesville, FL 32611 USA
[3] Hangzhou Dian Zi Univ, Hangzhou, Zhejiang, Peoples R China
来源
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021 | 2021年
关键词
D O I
10.1109/CVPR46437.2021.00969
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video instance segmentation (VIS) is a new and critical task in computer vision. To date, top-performing VIS methods extend the two-stage Mask R-CNN by adding a tracking branch, leaving plenty of room for improvement. In contrast, we approach the VIS task from a new perspective and propose a one-stage spatial granularity network (SG-Net). Compared to the conventional two-stage methods, SG-Net demonstrates four advantages: 1) Our method has a one-stage compact architecture and each task head (detection, segmentation, and tracking) is crafted interdependently so they can effectively share features and enjoy the joint optimization; 2) Our mask prediction is dynamically performed on the sub-regions of each detected instance, leading to high-quality masks of fine granularity; 3) Each of our task predictions avoids using expensive proposal-based RoI features, resulting in much reduced runtime complexity per instance; 4) Our tracking head models objects' centerness movements for tracking, which effectively enhances the tracking robustness to different object appearances. In evaluation, we present state-of-the-art comparisons on the YouTube-VIS dataset. Extensive experiments demonstrate that our compact one-stage method can achieve improved performance in both accuracy and inference speed. We hope our SG-Net could serve as a strong and flexible baseline for the VIS task. Our code will be available here(1).
引用
收藏
页码:9811 / 9820
页数:10
相关论文
共 48 条
[1]  
[Anonymous], pervised Semantic Segmentation via Adversarial Learning
[2]  
[Anonymous], 2017, ADV NEURAL INFORM PR
[3]   STEm-Seg: Spatio-Temporal Embeddings for Instance Segmentation in Videos [J].
Athar, Ali ;
Mahadevan, Sabarinath ;
Osep, Aljosa ;
Leal-Taixe, Laura ;
Leibe, Bastian .
COMPUTER VISION - ECCV 2020, PT XI, 2020, 12356 :158-177
[4]   Classifying, Segmenting, and Tracking Object Instances in Video with Mask Propagation [J].
Bertasius, Gedas ;
Torresani, Lorenzo .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :9736-9745
[5]   YOLACT Real-time Instance Segmentation [J].
Bolya, Daniel ;
Zhou, Chong ;
Xiao, Fanyi ;
Lee, Yong Jae .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :9156-9165
[6]   SipMask: Spatial Information Preservation for Fast Image and Video Instance Segmentation [J].
Cao, Jiale ;
Anwer, Rao Muhammad ;
Cholakkal, Hisham ;
Khan, Fahad Shahbaz ;
Pang, Yanwei ;
Shao, Ling .
COMPUTER VISION - ECCV 2020, PT XIV, 2020, 12359 :1-18
[7]   BlendMask: Top-Down Meets Bottom-Up for Instance Segmentation [J].
Chen, Hao ;
Sun, Kunyang ;
Tian, Zhi ;
Shen, Chunhua ;
Huang, Yongming ;
Yan, Youliang .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :8570-8578
[8]   Hybrid Task Cascade for Instance Segmentation [J].
Chen, Kai ;
Pang, Jiangmiao ;
Wang, Jiaqi ;
Xiong, Yu ;
Li, Xiaoxiao ;
Sun, Shuyang ;
Feng, Wansen ;
Liu, Ziwei ;
Shi, Jianping ;
Ouyang, Wanli ;
Loy, Chen Change ;
Lin, Dahua .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :4969-4978
[9]   Instance-Sensitive Fully Convolutional Networks [J].
Dai, Jifeng ;
He, Kaiming ;
Li, Yi ;
Ren, Shaoqing ;
Sun, Jian .
COMPUTER VISION - ECCV 2016, PT VI, 2016, 9910 :534-549
[10]   Temporal Feature Augmented Network for Video Instance Segmentation [J].
Dong, Minghui ;
Wang, Jian ;
Huang, Yuanyuan ;
Yu, Dongdong ;
Su, Kai ;
Zhou, Kaihui ;
Shao, Jie ;
Wen, Shiping ;
Wang, Changhu .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW), 2019, :721-724