Improving vision transformer for medical image classification via token-wise perturbation

被引:2
|
作者
Li, Yuexiang [1 ]
Huang, Yawen [2 ]
He, Nanjun [3 ]
Ma, Kai [2 ]
Zheng, Yefeng [1 ,2 ]
机构
[1] Guangxi Med Univ, Med AI Res MARS Grp, Guangxi Collaborat Innovat Ctr Genom & Personalize, Ctr Genom & Personalized Med,Guangxi Key Lab Genom, Nanning 530021, Peoples R China
[2] Tencent Jarvis Res Ctr, YouTu Lab, Shenzhen 518000, Peoples R China
[3] OPPO, Shenzhen 518000, Peoples R China
基金
国家重点研发计划;
关键词
Self-supervised learning; Vision transformer; Image classification;
D O I
10.1016/j.jvcir.2023.104022
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Transformer has achieved impressive successes for various computer vision tasks. However, most of existing studies require to pretrain the Transformer backbone on a large-scale labeled dataset (e.g., ImageNet) for achieving satisfactory performance, which is usually unavailable for medical images. Additionally, due to the gap between medical and natural images, the improvement generated by the ImageNet pretrained weights significantly degrades while transferring the weights to medical image processing tasks. In this paper, we propose Bootstrap Own Latent of Transformer (BOLT), a self-supervised learning (SSL) approach specifically for medical image classification with the Transformer backbone. Our BOLT consists of two networks, namely online and target branches, for self-supervised representation learning. Concretely, the online network is trained to predict the target network representation of the same patch embedding tokens with a different perturbation. To maximally excavate the impact of Transformer from limited medical data, we propose an auxiliary difficulty ranking task. The Transformer is enforced to identify which branch (i.e., online/target) is processing the more difficult perturbed tokens. Overall, the Transformer endeavours itself to distill the transformation -invariant features from the perturbed tokens to simultaneously achieve difficulty measurement and maintain the consistency of self-supervised representations. The proposed BOLT is evaluated on three medical image processing tasks, i.e., skin lesion classification, knee fatigue fracture grading and diabetic retinopathy grading. The experimental results validate the superiority of our BOLT for medical image classification, compared to ImageNet pretrained weights and state-of-the-art SSL approaches.
引用
收藏
页数:9
相关论文
共 50 条
  • [41] ViT-AE plus plus : Improving Vision Transformer Autoencoder for Self-supervised Medical Image Representations
    Prabhakar, Chinmay
    Li, Hongwei Bran
    Yang, Jiancheng
    Shit, Suprosanna
    Wiestler, Benedikt
    Menze, Bjoern H.
    MEDICAL IMAGING WITH DEEP LEARNING, VOL 227, 2023, 227 : 666 - 679
  • [42] Medical image classification: Knowledge transfer via residual U-Net and vision transformer-based teacher-student model with knowledge distillation
    Song, Yucheng
    Wang, Jincan
    Ge, Yifan
    Li, Lifeng
    Guo, Jia
    Dong, Quanxing
    Liao, Zhifang
    JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2024, 102
  • [43] Remote Sensing Scene Classification via Second-Order Differentiable Token Transformer Network
    Ni, Kang
    Wu, Qianqian
    Li, Sichan
    Zheng, Zhizhong
    Wang, Peng
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62 : 1 - 15
  • [44] Multi-part Token Transformer with Dual Contrastive Learning for Fine-grained Image Classification
    Wang, Chuanming
    Fu, Huiyuan
    Ma, Huadong
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 7648 - 7656
  • [45] MIL-ViT: A multiple instance vision transformer for fundus image classification
    Bi, Qi
    Sun, Xu
    Yu, Shuang
    Ma, Kai
    Bian, Cheng
    Ning, Munan
    He, Nanjun
    Huang, Yawen
    Li, Yuexiang
    Liu, Hanruo
    Zheng, Yefeng
    JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2023, 97
  • [46] Lightweight vision image transformer (LViT) model for skin cancer disease classification
    Dwivedi, Tanay
    Chaurasia, Brijesh Kumar
    Shukla, Man Mohan
    INTERNATIONAL JOURNAL OF SYSTEM ASSURANCE ENGINEERING AND MANAGEMENT, 2024, 15 (10) : 5030 - 5055
  • [47] ViTFSL-Baseline: A Simple Baseline of Vision Transformer Network for Few-Shot Image Classification
    Wang, Guangpeng
    Wang, Yongxiong
    Pan, Zhiqun
    Wang, Xiaoming
    Zhang, Jiapeng
    Pan, Jiayun
    IEEE ACCESS, 2024, 12 : 11836 - 11849
  • [48] A Hyperspectral Image Classification Method Based on Adaptive Spectral Spatial Kernel Combined with Improved Vision Transformer
    Wang, Aili
    Xing, Shuang
    Zhao, Yan
    Wu, Haibin
    Iwahori, Yuji
    REMOTE SENSING, 2022, 14 (15)
  • [49] Network Intrusion Detection Based on Feature Image and Deformable Vision Transformer Classification
    He, Kan
    Zhang, Wei
    Zong, Xuejun
    Lian, Lian
    IEEE ACCESS, 2024, 12 : 44335 - 44350
  • [50] TransMCGC: a recast vision transformer for small-scale image classification tasks
    Jian-Wen Xiang
    Min-Rong Chen
    Pei-Shan Li
    Hao-Li Zou
    Shi-Da Li
    Jun-Jie Huang
    Neural Computing and Applications, 2023, 35 : 7697 - 7718