Revitalizing CNN Attentions via Transformers in Self-Supervised Visual Representation Learning

被引：0

作者：

Ge, Chongjian ^{[1
]}

Liang, Youwei ^{[2
]}

Song, Yibing ^{[2
]}

Jiao, Jianbo ^{[3
]}

Wang, Jue ^{[2
]}

Luo, Ping ^{[1
]}

机构：

[1] Univ Hong Kong, Hong Kong, Peoples R China

[2] Tencent AI Lab, Bellevue, WA 98004 USA

[3] Univ Oxford, Oxford, England

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021) | 2021年 / 34卷

基金：

英国工程与自然科学研究理事会;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Studies on self-supervised visual representation learning (SSL) improve encoder backbones to discriminate training samples without labels. While CNN encoders via SSL achieve comparable recognition performance to those via supervised learning, their network attention is under-explored for further improvement. Motivated by the transformers that explore visual attention effectively in recognition scenarios, we propose a CNN Attention REvitalization (CARE) framework to train attentive CNN encoders guided by transformers in SSL. The proposed CARE framework consists of a CNN stream (C-stream) and a transformer stream (T-stream), where each stream contains two branches. C-stream follows an existing SSL framework with two CNN encoders, two projectors, and a predictor. T-stream contains two transformers, two projectors, and a predictor. T-stream connects to CNN encoders and is in parallel to the remaining C-Stream. During training, we perform SSL in both streams simultaneously and use the T-stream output to supervise C-stream. The features from CNN encoders are modulated in T-stream for visual attention enhancement and become suitable for the SSL scenario. We use these modulated features to supervise C-stream for learning attentive CNN encoders. To this end, we revitalize CNN attention by using transformers as guidance. Experiments on several standard visual recognition benchmarks, including image classification, object detection, and semantic segmentation, show that the proposed CARE framework improves CNN encoder backbones to the state-of-the-art performance.

引用

页数：14

共 50 条

[31] Can Semantic Labels Assist Self-Supervised Visual Representation Learning?
Wei, Longhui
Xie, Lingxi
He, Jianzhong
Zhang, Xiaopeng
Tian, Qi
THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 2642 - 2650
[32] MULTI-AUGMENTATION FOR EFFICIENT SELF-SUPERVISED VISUAL REPRESENTATION LEARNING
Tran, Van Nhiem
Huang, Chi-En
Liu, Shen-Hsuan
Yang, Kai-Lin
Ko, Timothy
Li, Yung-Hui
2022 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO WORKSHOPS (IEEE ICMEW 2022), 2022,
[33] Scaling Vision Transformers to Gigapixel Images via Hierarchical Self-Supervised Learning
Chen, Richard J.
Chen, Chengkuan
Li, Yicong
Chen, Tiffany Y.
Trister, Andrew D.
Krishnan, Rahul G.
Mahmood, Faisal
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 16123 - 16134
[34] Towards Pointsets Representation Learning via Self-Supervised Learning and Set Augmentation
Arsomngern, Pattaramanee
Long, Cheng
Suwajanakorn, Supasorn
Nutanong, Sarana
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (01) : 1201 - 1216
[35] Self-supervised Video Hashing via Bidirectional Transformers
Li, Shuyan
Li, Xiu
Lu, Jiwen
Zhou, Jie
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 13544 - 13553
[36] Stereo Depth Estimation via Self-supervised Contrastive Representation Learning
Tukra, Samyakh
Giannarou, Stamatia
MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2022, PT VII, 2022, 13437 : 604 - 614
[37] Self-Supervised Facial Motion Representation Learning via Contrastive Subclips
Sun, Zheng
Torrie, Shad A.
Sumsion, Andrew W.
Lee, Dah-Jye
ELECTRONICS, 2023, 12 (06)
[38] Self-Supervised Video Representation Learning via Latent Time Navigation
Yang, Di
Wang, Yaohui
Kong, Quan
Dantcheva, Antitza
Garattoni, Lorenzo
Francesca, Gianpiero
Bremond, Francois
THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 3, 2023, : 3118 - 3126
[39] DocMAE: Document Image Rectification via Self-supervised Representation Learning
Liu, Shaokai
Feng, Hao
Zhou, Wengang
Li, Houqiang
Liu, Cong
Wu, Feng
2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 1613 - 1618
[40] METRICBERT: TEXT REPRESENTATION LEARNING VIA SELF-SUPERVISED TRIPLET TRAINING
Malkiel, Itzik
Ginzburg, Dvir
Barkan, Oren
Caciularu, Avi
Weill, Yoni
Koenigstein, Noam
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 8142 - 8146

← 1 2 3 4 5 →