SG-Net: Spatial Granularity Network for One-Stage Video Instance Segmentation

被引：148

作者：

Liu, Dongfang ^{[1
]}

Cui, Yiming ^{[2
]}

Tan, Wenbo ^{[3
]}

Chen, Yingjie ^{[1
]}

机构：

[1] Purdue Univ, W Lafayette, IN 47907 USA

[2] Univ Florida, Gainesville, FL 32611 USA

[3] Hangzhou Dian Zi Univ, Hangzhou, Zhejiang, Peoples R China

来源：

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021 | 2021年

关键词：

D O I：

10.1109/CVPR46437.2021.00969

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Video instance segmentation (VIS) is a new and critical task in computer vision. To date, top-performing VIS methods extend the two-stage Mask R-CNN by adding a tracking branch, leaving plenty of room for improvement. In contrast, we approach the VIS task from a new perspective and propose a one-stage spatial granularity network (SG-Net). Compared to the conventional two-stage methods, SG-Net demonstrates four advantages: 1) Our method has a one-stage compact architecture and each task head (detection, segmentation, and tracking) is crafted interdependently so they can effectively share features and enjoy the joint optimization; 2) Our mask prediction is dynamically performed on the sub-regions of each detected instance, leading to high-quality masks of fine granularity; 3) Each of our task predictions avoids using expensive proposal-based RoI features, resulting in much reduced runtime complexity per instance; 4) Our tracking head models objects' centerness movements for tracking, which effectively enhances the tracking robustness to different object appearances. In evaluation, we present state-of-the-art comparisons on the YouTube-VIS dataset. Extensive experiments demonstrate that our compact one-stage method can achieve improved performance in both accuracy and inference speed. We hope our SG-Net could serve as a strong and flexible baseline for the VIS task. Our code will be available here(1).

引用

页码：9811 / 9820

页数：10

共 48 条

[1]

[Anonymous], pervised Semantic Segmentation via Adversarial Learning

[2]

[Anonymous], 2017, ADV NEURAL INFORM PR

[3] STEm-Seg: Spatio-Temporal Embeddings for Instance Segmentation in Videos [J].

Athar, Ali ;

Mahadevan, Sabarinath ;

Osep, Aljosa ;

Leal-Taixe, Laura ;

Leibe, Bastian .

COMPUTER VISION - ECCV 2020, PT XI, 2020, 12356 :158-177

[4] Classifying, Segmenting, and Tracking Object Instances in Video with Mask Propagation [J].

Bertasius, Gedas ;

Torresani, Lorenzo .

2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :9736-9745

[5] YOLACT Real-time Instance Segmentation [J].

Bolya, Daniel ;

Zhou, Chong ;

Xiao, Fanyi ;

Lee, Yong Jae .

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :9156-9165

[6] SipMask: Spatial Information Preservation for Fast Image and Video Instance Segmentation [J].

Cao, Jiale ;

Anwer, Rao Muhammad ;

Cholakkal, Hisham ;

Khan, Fahad Shahbaz ;

Pang, Yanwei ;

Shao, Ling .

COMPUTER VISION - ECCV 2020, PT XIV, 2020, 12359 :1-18

[7] BlendMask: Top-Down Meets Bottom-Up for Instance Segmentation [J].

Chen, Hao ;

Sun, Kunyang ;

Tian, Zhi ;

Shen, Chunhua ;

Huang, Yongming ;

Yan, Youliang .

2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :8570-8578

[8] Hybrid Task Cascade for Instance Segmentation [J].

Chen, Kai ;

Pang, Jiangmiao ;

Wang, Jiaqi ;

Xiong, Yu ;

Li, Xiaoxiao ;

Sun, Shuyang ;

Feng, Wansen ;

Liu, Ziwei ;

Shi, Jianping ;

Ouyang, Wanli ;

Loy, Chen Change ;

Lin, Dahua .

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :4969-4978

[9] Instance-Sensitive Fully Convolutional Networks [J].

Dai, Jifeng ;

He, Kaiming ;

Li, Yi ;

Ren, Shaoqing ;

Sun, Jian .

COMPUTER VISION - ECCV 2016, PT VI, 2016, 9910 :534-549

[10] Temporal Feature Augmented Network for Video Instance Segmentation [J].

Dong, Minghui ;

Wang, Jian ;

Huang, Yuanyuan ;

Yu, Dongdong ;

Su, Kai ;

Zhou, Kaihui ;

Shao, Jie ;

Wen, Shiping ;

Wang, Changhu .

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW), 2019, :721-724

← 1 2 3 4 5 →