Rethinking Feature Reconstruction via Category Prototype in Semantic Segmentation

被引:1
作者
Tang, Quan [1 ]
Liu, Chuanjian [2 ]
Liu, Fagui [3 ]
Jiang, Jun [1 ]
Zhang, Bowen [4 ]
Chen, C. L. Philip [3 ]
Han, Kai [2 ]
Wang, Yunhe [2 ]
机构
[1] Peng Cheng Lab, Dept New Network, Shenzhen 518000, Peoples R China
[2] Huawei Noahs Ark Lab, Beijing 100084, Peoples R China
[3] South China Univ Technol, Sch Comp Sci & Engn, Guangzhou 510006, Peoples R China
[4] Univ Adelaide, Sch Comp Sci, Adelaide, SA 5000, Australia
关键词
Image reconstruction; Prototypes; Semantic segmentation; Transformers; Memory modules; Kernel; Convolution; Semantics; Feature extraction; Decoding; Feature reconstruction; category prototype; pyramidal features; semantic segmentation;
D O I
10.1109/TIP.2025.3534532
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The encoder-decoder architecture is a prevailing paradigm for semantic segmentation. It has been discovered that aggregation of multi-stage encoder features plays a significant role in capturing discriminative pixel representation. In this work, we rethink feature reconstruction for scale alignment of multi-stage pyramidal features and treat it as a Query Update (Q-UP) task. Pixel-wise affinity scores are calculated between the high-resolution query map and low-resolution feature map to dynamically broadcast low-resolution pixel features to match a higher resolution. Unlike prior works (e.g. bilinear interpolation) that only exploit sub-pixel neighborhoods, Q-UP samples contextual information within a global receptive field via a data-dependent manner. To alleviate intra-category feature variance, we substitute source pixel features for feature reconstruction with their corresponding category prototype that is assessed by averaging all pixel features belonging to that category. Besides, a memory module is proposed to explore the capacity of category prototypes at the dataset level. We refer to the method as Category Prototype Transformer (CPT). We conduct extensive experiments on popular benchmarks. Integrating CPT into a feature pyramid structure exhibits superior performance for semantic segmentation even with low-resolution feature maps, e.g. 1/32 of the input size, significantly reducing computational complexity. Specifically, the proposed method obtains a compelling 55.5% mIoU with greatly reduced model parameters and computations on the challenging ADE20K dataset.
引用
收藏
页码:1036 / 1047
页数:12
相关论文
共 66 条
[1]  
Ba J.L., 2016, arXiv
[2]  
Bousselham W., 2021, Efficient self-ensemble for semantic segmentation., DOI arXiv:2111.13280
[3]  
Chen LC, 2017, Arxiv, DOI [arXiv:1706.05587, 10.48550/arXiv.1706.05587, DOI 10.48550/ARXIV.1706.05587]
[4]   Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation [J].
Chen, Liang-Chieh ;
Zhu, Yukun ;
Papandreou, George ;
Schroff, Florian ;
Adam, Hartwig .
COMPUTER VISION - ECCV 2018, PT VII, 2018, 11211 :833-851
[5]  
Chen Z, 2023, Arxiv, DOI arXiv:2205.08534
[6]  
Cheng B, 2021, ADV NEUR IN, V34
[7]   Masked-attention Mask Transformer for Universal Image Segmentation [J].
Cheng, Bowen ;
Misra, Ishan ;
Schwing, Alexander G. ;
Kirillov, Alexander ;
Girdhar, Rohit .
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :1280-1289
[8]   The Cityscapes Dataset for Semantic Urban Scene Understanding [J].
Cordts, Marius ;
Omran, Mohamed ;
Ramos, Sebastian ;
Rehfeld, Timo ;
Enzweiler, Markus ;
Benenson, Rodrigo ;
Franke, Uwe ;
Roth, Stefan ;
Schiele, Bernt .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :3213-3223
[9]   Context Contrasted Feature and Gated Multi-scale Aggregation for Scene Segmentation [J].
Ding, Henghui ;
Jiang, Xudong ;
Shuai, Bing ;
Liu, Ai Qun ;
Wang, Gang .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :2393-2402
[10]  
Dosovitskiy A, 2021, Arxiv, DOI [arXiv:2010.11929, 10.48550/arXiv.2010.11929]