Video semantic segmentation via feature propagation with holistic attention

被引：23

作者：

Wu, Junrong ^{[1
]}

Wen, Zongzheng ^{[1
]}

Zhao, Sanyuan ^{[1
]}

Huang, Kele ^{[2
]}

机构：

[1] Beijing Inst Technol, Sch Comp Sci, Beijing Lab Intelligent Informat Technol, Beijing 100081, Peoples R China

[2] Univ Chinese Acad Sci, Beijing, Peoples R China

来源：

PATTERN RECOGNITION | 2020年 / 104卷

基金：

中国国家自然科学基金;

关键词：

Real-time; Attention mechanism; Feature propagation; Video semantic segmentation;

D O I：

10.1016/j.patcog.2020.107268

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Since the frames of a video are inherently contiguous, information redundancy is ubiquitous. Unlike previous works densely process each frame of a video, in this paper we present a novel method to focus on efficient feature propagation across frames to tackle the challenging video semantic segmentation task. Firstly, we propose a Light, Efficient and Real-time network (denoted as LERNet) as a strong backbone network for per-frame processing. Then we mine rich features within a key frame and propagate the across-frame consistency information by calculating a temporal holistic attention with the following non-key frame. Each element of the attention matrix represents the global correlation between pixels of a non-key frame and the previous key frame. Concretely, we propose a brand-new attention module to capture the spatial consistency on low-level features along temporal dimension. Then we employ the attention weights as a spatial transition guidance for directly generating high-level features of the current non-key frame from the weighted corresponding key frame. Finally, we efficiently fuse the hierarchical features of the non-key frame and obtain the final segmentation result. Extensive experiments on two popular datasets, i.e. the CityScapes and the CamVid, demonstrate that the proposed approach achieves a remarkable balance between inference speed and accuracy. (C) 2020 Elsevier Ltd. All rights reserved.

引用

页数：11

共 42 条

[1] [Anonymous], 2019, IEEE T PATTERN ANAL
[2] SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation
Badrinarayanan, Vijay
Kendall, Alex
Cipolla, Roberto
[J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2017, 39 (12) : 2481 - 2495
[3] Semantic object classes in video: A high-definition ground truth database
Brostow, Gabriel J.
Fauqueur, Julien
Cipolla, Roberto
[J]. PATTERN RECOGNITION LETTERS, 2009, 30 (02) : 88 - 97
[4] Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation
Chen, Liang-Chieh
Zhu, Yukun
Papandreou, George
Schroff, Florian
Adam, Hartwig
[J]. COMPUTER VISION - ECCV 2018, PT VII, 2018, 11211 : 833 - 851
[5] DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs
Chen, Liang-Chieh
Papandreou, George
Kokkinos, Iasonas
Murphy, Kevin
Yuille, Alan L.
[J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2018, 40 (04) : 834 - 848
[6] The Cityscapes Dataset for Semantic Urban Scene Understanding
Cordts, Marius
Omran, Mohamed
Ramos, Sebastian
Rehfeld, Timo
Enzweiler, Markus
Benenson, Rodrigo
Franke, Uwe
Roth, Stefan
Schiele, Bernt
[J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 3213 - 3223
[7] Fan XM, 2017, IN C IND ENG ENG MAN, P2341, DOI 10.1109/IEEM.2017.8290310
[8] Dual Attention Network for Scene Segmentation
Fu, Jun
Liu, Jing
Tian, Haijie
Li, Yong
Bao, Yongjun
Fang, Zhiwei
Lu, Hanqing
[J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 3141 - 3149
[9] Hu J, 2018, PROC CVPR IEEE, P7132, DOI [10.1109/CVPR.2018.00745, 10.1109/TPAMI.2019.2913372]
[10] CCNet: Criss-Cross Attention for Semantic Segmentation
Huang, Zilong
Wang, Xinggang
Huang, Lichao
Huang, Chang
Wei, Yunchao
Liu, Wenyu
[J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 603 - 612

← 1 2 3 4 5 →