Rethinking Local and Global Feature Representation for Dense Prediction

被引:14
作者
Chen, Mohan [1 ]
Zhang, Li [1 ]
Feng, Rui [1 ]
Xue, Xiangyang [1 ]
Feng, Jianfeng [1 ]
机构
[1] Fudan Univ, Shanghai, Peoples R China
基金
中国国家自然科学基金; 上海市自然科学基金;
关键词
Dense prediction; Vision transformer; Semantic segmentation; Depth estimation; Object detection;
D O I
10.1016/j.patcog.2022.109168
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Although fully convolution networks (FCNs) have dominated dense prediction tasks (e.g., semantic seg-mentation, depth estimation and object detection) for decades, they are inherently limited in captur-ing long-range structured relationship with the layers of local kernels. While recent Transformer-based models have proven extremely successful in computer vision tasks by capturing global representation, they would deteriorate dense prediction results by over-smoothing the regions containing fine details (e.g., boundaries and small objects). To this end, we aim to provide an alternative perspective by re-thinking local and global feature representation for the dense prediction task. Specifically, we deploy a Dual-Stream Convolution-Transformer architecture, called DSCT, by taking advantage of both the convolu-tion and Transformer to learn a rich feature representation, combining with a task decoder to provide a powerful dense prediction model. DSCT extracts high resolution local feature representation from convo-lution layers and global feature representation from Transformer layers. With the local and global context modeled explicitly in every layer, the two streams can be combined with a decoder to perform task of se-mantic segmentation, monocular depth estimation or object detection. Extensive experiments show that DSCT can achieve superior performance on the three tasks above. For semantic segmentation, DSCT builds a new state of the art on Cityscapes validation set (83.31% mIoU) with only 80,0 0 0 training iterations and appealing performance (49.27% mIoU) on ADE20K validation set, outperforming most of the alternatives. For monocular depth estimation, our model achieves 2.423 RMSE on KITTI Eigen split, superior to most of the convolution or Transformer counterparts. For object detection, without using FPN, we can achieve 44.5% APb on COCO dataset when using Faster R-CNN, which is higher than Conformer.(c) 2022 Elsevier Ltd. All rights reserved.
引用
收藏
页数:11
相关论文
共 66 条
[41]   Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks [J].
Ren, Shaoqing ;
He, Kaiming ;
Girshick, Ross ;
Sun, Jian .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2017, 39 (06) :1137-1149
[42]   U-Net: Convolutional Networks for Biomedical Image Segmentation [J].
Ronneberger, Olaf ;
Fischer, Philipp ;
Brox, Thomas .
MEDICAL IMAGE COMPUTING AND COMPUTER-ASSISTED INTERVENTION, PT III, 2015, 9351 :234-241
[43]   Training Region-based Object Detectors with Online Hard Example Mining [J].
Shrivastava, Abhinav ;
Gupta, Abhinav ;
Girshick, Ross .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :761-769
[44]   Bottleneck Transformers for Visual Recognition [J].
Srinivas, Aravind ;
Lin, Tsung-Yi ;
Parmar, Niki ;
Shlens, Jonathon ;
Abbeel, Pieter ;
Vaswani, Ashish .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :16514-16524
[45]   Deep High-Resolution Representation Learning for Human Pose Estimation [J].
Sun, Ke ;
Xiao, Bin ;
Liu, Dong ;
Wang, Jingdong .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :5686-5696
[46]  
Tao AD, 2020, Arxiv, DOI arXiv:2005.10821
[47]  
Vaswani A, 2017, ADV NEUR IN, V30
[48]   Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions [J].
Wang, Wenhai ;
Xie, Enze ;
Li, Xiang ;
Fan, Deng-Ping ;
Song, Kaitao ;
Liang, Ding ;
Lu, Tong ;
Luo, Ping ;
Shao, Ling .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :548-558
[49]   CBAM: Convolutional Block Attention Module [J].
Woo, Sanghyun ;
Park, Jongchan ;
Lee, Joon-Young ;
Kweon, In So .
COMPUTER VISION - ECCV 2018, PT VII, 2018, 11211 :3-19
[50]   CvT: Introducing Convolutions to Vision Transformers [J].
Wu, Haiping ;
Xiao, Bin ;
Codella, Noel ;
Liu, Mengchen ;
Dai, Xiyang ;
Yuan, Lu ;
Zhang, Lei .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :22-31