MIDWRSEG: ACQUIRING ADAPTIVE MULTI-SCALE CONTEXTUAL INFORMATION FOR ROAD-SCENE SEMANTIC SEGMENTATION

被引：0

作者：

Su, Bing ^{[1
]}

Jin, Peng ^{[1
]}

Lin, Yifeng ^{[1
]}

Wang, Fuyang ^{[1
]}

机构：

[1] Changzhou Univ, Sch Comp Sci & Artif Intelligence, 2468 YanZeng West Rd, Changzhou, Jiangsu, Peoples R China

来源：

COMPUTING AND INFORMATICS | 2024年 / 43卷 / 04期

关键词：

Deep convolutional network; attention mechanism; semantic segmen- tation; autonomous driving; NETWORK;

D O I：

10.31577/cai_2024_4_849

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We present MIDWRSeg, a simple semantic segmentation model based on neural network architecture. For complex road scenes, a large receptive field gathered at multiple scales is crucial for semantic segmentation tasks. Currently, there is an urgent need for the CNN architecture to establish long-range dependencies (large receptive fields) akin to the unique attention mechanism employed by the Transformer architecture. However, the high complexity of the attention mechanism formed by the matrix operations of Query, Key and Value cannot be borne by real-time semantic segmentation models. Therefore, a Multi-Scale Convolutional Attention (MSCA) block is constructed using inexpensive convolution operations to form long distance dependencies. In this method, the model adopts a Simple Inverted Residual (SIR) block for feature extraction in the initial encoding stage. After downsampling, the feature maps with reduced resolution undergo a sequence of stacked MSCA blocks, resulting in the formation of multi-scale long-range dependencies. Finally, in order to further enrich the size of the adaptive receptive field, an Internal Depth Wise Residual (IDWR) block is introduced. In the decoding stage, a simple decoder similar to FCN is used to alleviate computational consumption. Our method has formed a competitive advantage with existing real-time semantic segmentation models for encoder-decoder on Cityscapes and CamVid datasets. Our MIDWRSeg achieves 74.2 % mIoU at a speed of 88.9 FPS at Cityscapes test and achieves 76.8 % mIoU at a speed of 95.2 FPS at CamVid test.

引用

页码：849 / 873

页数：25

共 48 条

[1] SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation
Badrinarayanan, Vijay
Kendall, Alex
Cipolla, Roberto
[J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2017, 39 (12) : 2481 - 2495
[2] Large-Scale Machine Learning with Stochastic Gradient Descent
Bottou, Leon
[J]. COMPSTAT'2010: 19TH INTERNATIONAL CONFERENCE ON COMPUTATIONAL STATISTICS, 2010, : 177 - 186
[3] Segmentation and Recognition Using Structure from Motion Point Clouds
Brostow, Gabriel J.
Shotton, Jamie
Fauqueur, Julien
Cipolla, Roberto
[J]. COMPUTER VISION - ECCV 2008, PT I, PROCEEDINGS, 2008, 5302 : 44 - +
[4] Semantic object classes in video: A high-definition ground truth database
Brostow, Gabriel J.
Fauqueur, Julien
Cipolla, Roberto
[J]. PATTERN RECOGNITION LETTERS, 2009, 30 (02) : 88 - 97
[5] Chaurasia A, 2017, 2017 IEEE VISUAL COMMUNICATIONS AND IMAGE PROCESSING (VCIP)
[6] Chen L.C., 2014, ARXIV14127062
[7] Chen LC, 2017, Arxiv, DOI arXiv:1706.05587
[8] DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs
Chen, Liang-Chieh
Papandreou, George
Kokkinos, Iasonas
Murphy, Kevin
Yuille, Alan L.
[J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2018, 40 (04) : 834 - 848
[9] The Cityscapes Dataset for Semantic Urban Scene Understanding
Cordts, Marius
Omran, Mohamed
Ramos, Sebastian
Rehfeld, Timo
Enzweiler, Markus
Benenson, Rodrigo
Franke, Uwe
Roth, Stefan
Schiele, Bernt
[J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 3213 - 3223
[10] Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in CNNs
Ding, Xiaohan
Zhang, Xiangyu
Han, Jungong
Ding, Guiguang
[J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 11953 - 11965

← 1 2 3 4 5 →