Multi-Granularity Context Network for Efficient Video Semantic Segmentation

被引：3

作者：

Liang, Zhiyuan ^{[1
]}

Dai, Xiangdong ^{[2
]}

Wu, Yiqian ^{[3
]}

Jin, Xiaogang ^{[3
]}

Shen, Jianbing ^{[4
]}

机构：

[1] Beijing Inst Technol, Sch Comp Sci, Beijing Lab Intelligent Informat Technol, Beijing 100081, Peoples R China

[2] Guangdong OPPO Mobile Telecommun Corp Ltd, Guangdong 523860, Peoples R China

[3] Zhejiang Univ, State Key Lab CAD & CG, Hangzhou 310058, Peoples R China

[4] Univ Macau, Dept Comp & Informat Sci, State Key Lab Internet Things Smart City, Macau, Peoples R China

来源：

IEEE TRANSACTIONS ON IMAGE PROCESSING | 2023年 / 32卷

基金：

中国国家自然科学基金;

关键词：

Semantics; Semantic segmentation; Prototypes; Aggregates; Feature extraction; Training; Task analysis; Video semantic segmentation; light-weight networks; non-local operation;

D O I：

10.1109/TIP.2023.3269982

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Current video semantic segmentation tasks involve two main challenges: how to take full advantage of multi-frame context information, and how to improve computational efficiency. To tackle the two challenges simultaneously, we present a novel Multi-Granularity Context Network (MGCNet) by aggregating context information at multiple granularities in a more effective and efficient way. Our method first converts image features into semantic prototypes, and then conducts a non-local operation to aggregate the per-frame and short-term contexts jointly. An additional long-term context module is introduced to capture the video-level semantic information during training. By aggregating both local and global semantic information, a strong feature representation is obtained. The proposed pixel-to-prototype non-local operation requires less computational cost than traditional non-local ones, and is video-friendly since it reuses the semantic prototypes of previous frames. Moreover, we propose an uncertainty-aware and structural knowledge distillation strategy to boost the performance of our method. Experiments on Cityscapes and CamVid datasets with multiple backbones demonstrate that the proposed MGCNet outperforms other state-of-the-art methods with high speed and low latency.

引用

页码：3163 / 3175

页数：13

共 56 条

[1] Segmentation and Recognition Using Structure from Motion Point Clouds
Brostow, Gabriel J.
Shotton, Jamie
Fauqueur, Julien
Cipolla, Roberto
[J]. COMPUTER VISION - ECCV 2008, PT I, PROCEEDINGS, 2008, 5302 : 44 - +
[2] Cai MJ, 2020, PROC CVPR IEEE, P14380, DOI 10.1109/CVPR42600.2020.01440
[3] Changqian Yu, 2020, Computer Vision - ECCV 2020. 16th European Conference. Proceedings. Lecture Notes in Computer Science (LNCS 12352), P379, DOI 10.1007/978-3-030-58571-6_23
[4] Memory Enhanced Global-Local Aggregation for Video Object Detection
Chen, Yihong
Cao, Yue
Hu, Han
Wang, Liwei
[J]. 2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, : 10334 - 10343
[5] The Cityscapes Dataset for Semantic Urban Scene Understanding
Cordts, Marius
Omran, Mohamed
Ramos, Sebastian
Rehfeld, Timo
Enzweiler, Markus
Benenson, Rodrigo
Franke, Uwe
Roth, Stefan
Schiele, Bernt
[J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 3213 - 3223
[6] Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848
[7] Dosovitskiy A., 2017, C ROBOT LEARNING, V78, P1
[8] Dual Attention Network for Scene Segmentation
Fu, Jun
Liu, Jing
Tian, Haijie
Li, Yong
Bao, Yongjun
Fang, Zhiwei
Lu, Hanqing
[J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 3141 - 3149
[9] Semantic Video CNNs through Representation Warping
Gadde, Raghudeep
Jampani, Varun
Gehler, Peter V.
[J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 4463 - 4472
[10] Gal Y, 2016, PR MACH LEARN RES, V48

← 1 2 3 4 5 6 →