Rethinking Local and Global Feature Representation for Dense Prediction

被引：11

作者：

Chen, Mohan ^{[1
]}

Zhang, Li ^{[1
]}

Feng, Rui ^{[1
]}

Xue, Xiangyang ^{[1
]}

Feng, Jianfeng ^{[1
]}

机构：

[1] Fudan Univ, Shanghai, Peoples R China

来源：

PATTERN RECOGNITION | 2023年 / 135卷

基金：

上海市自然科学基金; 中国国家自然科学基金;

关键词：

Dense prediction; Vision transformer; Semantic segmentation; Depth estimation; Object detection;

D O I：

10.1016/j.patcog.2022.109168

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Although fully convolution networks (FCNs) have dominated dense prediction tasks (e.g., semantic seg-mentation, depth estimation and object detection) for decades, they are inherently limited in captur-ing long-range structured relationship with the layers of local kernels. While recent Transformer-based models have proven extremely successful in computer vision tasks by capturing global representation, they would deteriorate dense prediction results by over-smoothing the regions containing fine details (e.g., boundaries and small objects). To this end, we aim to provide an alternative perspective by re-thinking local and global feature representation for the dense prediction task. Specifically, we deploy a Dual-Stream Convolution-Transformer architecture, called DSCT, by taking advantage of both the convolu-tion and Transformer to learn a rich feature representation, combining with a task decoder to provide a powerful dense prediction model. DSCT extracts high resolution local feature representation from convo-lution layers and global feature representation from Transformer layers. With the local and global context modeled explicitly in every layer, the two streams can be combined with a decoder to perform task of se-mantic segmentation, monocular depth estimation or object detection. Extensive experiments show that DSCT can achieve superior performance on the three tasks above. For semantic segmentation, DSCT builds a new state of the art on Cityscapes validation set (83.31% mIoU) with only 80,0 0 0 training iterations and appealing performance (49.27% mIoU) on ADE20K validation set, outperforming most of the alternatives. For monocular depth estimation, our model achieves 2.423 RMSE on KITTI Eigen split, superior to most of the convolution or Transformer counterparts. For object detection, without using FPN, we can achieve 44.5% APb on COCO dataset when using Faster R-CNN, which is higher than Conformer.(c) 2022 Elsevier Ltd. All rights reserved.

引用

页数：11

共 66 条

[1] [Anonymous], P IEEE INT C COMPUTE
[2] SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation
Badrinarayanan, Vijay
Kendall, Alex
Cipolla, Roberto
[J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2017, 39 (12) : 2481 - 2495
[3] AdaBins: Depth Estimation Using Adaptive Bins
Bhat, Shariq Farooq
Alhashim, Ibraheem
Wonka, Peter
[J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 4008 - 4017
[4] Carion Nicolas, 2020, Computer Vision - ECCV 2020. 16th European Conference. Proceedings. Lecture Notes in Computer Science (LNCS 12346), P213, DOI 10.1007/978-3-030-58452-8_13
[5] Chen K, 2019, Arxiv, DOI arXiv:1906.07155
[6] Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation
Chen, Liang-Chieh
Zhu, Yukun
Papandreou, George
Schroff, Florian
Adam, Hartwig
[J]. COMPUTER VISION - ECCV 2018, PT VII, 2018, 11211 : 833 - 851
[7] DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs
Chen, Liang-Chieh
Papandreou, George
Kokkinos, Iasonas
Murphy, Kevin
Yuille, Alan L.
[J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2018, 40 (04) : 834 - 848
[8] Chen M., 2021, BRIT MACHINE VISION
[9] Cong DC, 2019, INT CONF ACOUST SPEE, P1892, DOI [10.1109/ICASSP.2019.8683673, 10.1109/icassp.2019.8683673]
[10] Contributors M., 2020, MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark

← 1 2 3 4 5 6 7 →