Multi-Granularity Context Network for Efficient Video Semantic Segmentation

被引:3
作者
Liang, Zhiyuan [1 ]
Dai, Xiangdong [2 ]
Wu, Yiqian [3 ]
Jin, Xiaogang [3 ]
Shen, Jianbing [4 ]
机构
[1] Beijing Inst Technol, Sch Comp Sci, Beijing Lab Intelligent Informat Technol, Beijing 100081, Peoples R China
[2] Guangdong OPPO Mobile Telecommun Corp Ltd, Guangdong 523860, Peoples R China
[3] Zhejiang Univ, State Key Lab CAD & CG, Hangzhou 310058, Peoples R China
[4] Univ Macau, Dept Comp & Informat Sci, State Key Lab Internet Things Smart City, Macau, Peoples R China
基金
中国国家自然科学基金;
关键词
Semantics; Semantic segmentation; Prototypes; Aggregates; Feature extraction; Training; Task analysis; Video semantic segmentation; light-weight networks; non-local operation;
D O I
10.1109/TIP.2023.3269982
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Current video semantic segmentation tasks involve two main challenges: how to take full advantage of multi-frame context information, and how to improve computational efficiency. To tackle the two challenges simultaneously, we present a novel Multi-Granularity Context Network (MGCNet) by aggregating context information at multiple granularities in a more effective and efficient way. Our method first converts image features into semantic prototypes, and then conducts a non-local operation to aggregate the per-frame and short-term contexts jointly. An additional long-term context module is introduced to capture the video-level semantic information during training. By aggregating both local and global semantic information, a strong feature representation is obtained. The proposed pixel-to-prototype non-local operation requires less computational cost than traditional non-local ones, and is video-friendly since it reuses the semantic prototypes of previous frames. Moreover, we propose an uncertainty-aware and structural knowledge distillation strategy to boost the performance of our method. Experiments on Cityscapes and CamVid datasets with multiple backbones demonstrate that the proposed MGCNet outperforms other state-of-the-art methods with high speed and low latency.
引用
收藏
页码:3163 / 3175
页数:13
相关论文
共 56 条
  • [1] Segmentation and Recognition Using Structure from Motion Point Clouds
    Brostow, Gabriel J.
    Shotton, Jamie
    Fauqueur, Julien
    Cipolla, Roberto
    [J]. COMPUTER VISION - ECCV 2008, PT I, PROCEEDINGS, 2008, 5302 : 44 - +
  • [2] Cai MJ, 2020, PROC CVPR IEEE, P14380, DOI 10.1109/CVPR42600.2020.01440
  • [3] Changqian Yu, 2020, Computer Vision - ECCV 2020. 16th European Conference. Proceedings. Lecture Notes in Computer Science (LNCS 12352), P379, DOI 10.1007/978-3-030-58571-6_23
  • [4] Memory Enhanced Global-Local Aggregation for Video Object Detection
    Chen, Yihong
    Cao, Yue
    Hu, Han
    Wang, Liwei
    [J]. 2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, : 10334 - 10343
  • [5] The Cityscapes Dataset for Semantic Urban Scene Understanding
    Cordts, Marius
    Omran, Mohamed
    Ramos, Sebastian
    Rehfeld, Timo
    Enzweiler, Markus
    Benenson, Rodrigo
    Franke, Uwe
    Roth, Stefan
    Schiele, Bernt
    [J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 3213 - 3223
  • [6] Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848
  • [7] Dosovitskiy A., 2017, C ROBOT LEARNING, V78, P1
  • [8] Dual Attention Network for Scene Segmentation
    Fu, Jun
    Liu, Jing
    Tian, Haijie
    Li, Yong
    Bao, Yongjun
    Fang, Zhiwei
    Lu, Hanqing
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 3141 - 3149
  • [9] Semantic Video CNNs through Representation Warping
    Gadde, Raghudeep
    Jampani, Varun
    Gehler, Peter V.
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 4463 - 4472
  • [10] Gal Y, 2016, PR MACH LEARN RES, V48