A Good Student is Cooperative and Reliable: CNN-Transformer Collaborative Learning for Semantic Segmentation

被引:17
作者
Zhu, Jinjing [1 ]
Luo, Yunhao [3 ]
Zheng, Xu [1 ]
Wang, Hao [4 ]
Wang, Lin [1 ,2 ]
机构
[1] HKUST GZ, AI Thrust, Guangzhou, Peoples R China
[2] HKUST, Dept CSE, Guangzhou, Peoples R China
[3] Brown Univ, Providence, RI 02912 USA
[4] Alibaba Grp, Alibaba Cloud, Hangzhou, Peoples R China
来源
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023) | 2023年
基金
中国国家自然科学基金;
关键词
D O I
10.1109/ICCV51070.2023.01076
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we strive to answer the question 'how to collaboratively learn convolutional neural network (CNN)-based and vision transformer (ViT)-based models by selecting and exchanging the reliable knowledge between them for semantic segmentation?' Accordingly, we propose an online knowledge distillation (KD) framework that can simultaneously learn compact yet effective CNN-based and ViT-based models with two key technical breakthroughs to take full advantage of CNNs and ViT while compensating their limitations. Firstly, we propose heterogeneous feature distillation (HFD) to improve students' consistency in low-layer feature space by mimicking heterogeneous features between CNNs and ViT. Secondly, to facilitate the two students to learn reliable knowledge from each other, we propose bidirectional selective distillation (BSD) that can dynamically transfer selective knowledge. This is achieved by 1) region-wise BSD determining the directions of knowledge transferred between the corresponding regions in the feature space and 2) pixel-wise BSD discerning which of the prediction knowledge to be transferred in the logit space. Extensive experiments on three benchmark datasets demonstrate that our proposed framework outperforms the state-of-the-art online distillation methods by a large margin, and shows its efficacy in learning collaboratively between ViT-based and CNN-based models.
引用
收藏
页码:11686 / 11696
页数:11
相关论文
共 42 条
[1]  
Anbang Yao, 2020, Computer Vision - ECCV 2020. 16th European Conference. Proceedings. Lecture Notes in Computer Science (LNCS 12360), P294, DOI 10.1007/978-3-030-58555-6_18
[2]  
Anil R., 2018, 6 INT C LEARN REPR I
[3]  
[Anonymous], IEEE CVF C COMP VIS
[4]   Knowledge distillation: A good teacher is patient and consistent [J].
Beyer, Lucas ;
Zhai, Xiaohua ;
Royer, Amelie ;
Markeeva, Larisa ;
Anil, Rohan ;
Kolesnikov, Alexander .
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, :10915-10924
[5]   Segmentation and Recognition Using Structure from Motion Point Clouds [J].
Brostow, Gabriel J. ;
Shotton, Jamie ;
Fauqueur, Julien ;
Cipolla, Roberto .
COMPUTER VISION - ECCV 2008, PT I, PROCEEDINGS, 2008, 5302 :44-+
[6]  
Cai LH, 2022, AAAI CONF ARTIF INTE, P140
[7]  
Chen DF, 2020, AAAI CONF ARTIF INTE, V34, P3430
[8]   DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs [J].
Chen, Liang-Chieh ;
Papandreou, George ;
Kokkinos, Iasonas ;
Murphy, Kevin ;
Yuille, Alan L. .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2018, 40 (04) :834-848
[9]  
Chen Yinpeng, 2021, MOBILEFORMER BRIDGIN
[10]   The Cityscapes Dataset for Semantic Urban Scene Understanding [J].
Cordts, Marius ;
Omran, Mohamed ;
Ramos, Sebastian ;
Rehfeld, Timo ;
Enzweiler, Markus ;
Benenson, Rodrigo ;
Franke, Uwe ;
Roth, Stefan ;
Schiele, Bernt .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :3213-3223