Dual Transformer With Multi-Grained Assembly for Fine-Grained Visual Classification

被引:12
|
作者
Ji, Ruyi [1 ,2 ]
Li, Jiaying [3 ]
Zhang, Libo [1 ]
Liu, Jing [4 ,5 ]
Wu, Yanjun [1 ]
机构
[1] Chinese Acad Sci, State Key Lab Comp Sci, Inst Software, Beijing 100190, Peoples R China
[2] Univ Chinese Acad Sci, Sch Comp Sci & Technol, Beijing 101400, Peoples R China
[3] Beijing Informat Sci & Technol Univ, Sch Comp Sci, Beijing 100192, Peoples R China
[4] Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing 101400, Peoples R China
[5] Chinese Acad Sci, Inst Automat, Natl Lab Pattern Recognit, Beijing 100190, Peoples R China
关键词
Transformer; multi-grained assembly; fine-grained visual classification;
D O I
10.1109/TCSVT.2023.3248791
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Fine-grained visual classification requires distinguishing sub-categories within the same super-category, which suffers from small inter-class and large intra-class variances. This paper aims to improve the FGVC task towards better performance, for which we deliver a novel dual Transformer framework (coined Dual-TR) with multi-grained assembly. The Dual-TR is well-designed to encode fine-grained objects by two parallel hierarchies, which is amenable to capturing the subtle yet discriminative cues via the self-attention mechanism in ViT. Specifically, we perform orthogonal multi-grained assembly within the Transformer structure for a more robust representation, i.e., intra-layer and inter-layer assembly. The former aims to explore the informative feature in various self-attention heads within the Transformer layer. The latter pays attention to the token assembly across Transformer layers. Meanwhile, we introduce the constraint of center loss to pull intra-class samples' compactness and push that of inter-class samples. Extensive experiments show that Dual-TR performs on par with the state-of-the-art methods on four public benchmarks, including CUB-200-2011, NABirds, iNaturalist2017, and Stanford Dogs. The comprehensive ablation studies further demonstrate the effectiveness of architectural design choices.
引用
收藏
页码:5009 / 5021
页数:13
相关论文
共 50 条
  • [1] Dual-Dependency Attention Transformer for Fine-Grained Visual Classification
    Cui, Shiyan
    Hui, Bin
    SENSORS, 2024, 24 (07)
  • [2] Multi-Grained Selection and Fusion for Fine-Grained Image Representation
    Jiang, Jianrong
    Wang, Hongxing
    2020 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2020,
  • [3] Leveraging Fine-Grained Labels to Regularize Fine-Grained Visual Classification
    Wu, Junfeng
    Yao, Li
    Liu, Bin
    Ding, Zheyuan
    PROCEEDINGS OF THE 11TH INTERNATIONAL CONFERENCE ON COMPUTER MODELING AND SIMULATION (ICCMS 2019) AND 8TH INTERNATIONAL CONFERENCE ON INTELLIGENT COMPUTING AND APPLICATIONS (ICICA 2019), 2019, : 133 - 136
  • [4] Hierarchical attention vision transformer for fine-grained visual classification
    Hu, Xiaobin
    Zhu, Shining
    Peng, Taile
    JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2023, 91
  • [5] TransFGVC: transformer-based fine-grained visual classification
    Shen, Longfeng
    Hou, Bin
    Jian, Yulei
    Tu, Xisong
    Zhang, Yingjie
    Shuai, Lingying
    Ge, Fangzhen
    Chen, Debao
    VISUAL COMPUTER, 2025, 41 (04): : 2439 - 2459
  • [6] Fine-grained activity classification in assembly based on multi-visual modalities
    Chen, Haodong
    Zendehdel, Niloofar
    Leu, Ming C.
    Yin, Zhaozheng
    JOURNAL OF INTELLIGENT MANUFACTURING, 2024, 35 (05) : 2215 - 2233
  • [7] Convolutionally Enhanced Feature Fusion Visual Transformer for Fine-Grained Visual Classification
    Huang, Min
    Zhu, Saixing
    Wang, Zehua
    Qu, Shuanghong
    2024 16TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND COMPUTING, ICMLC 2024, 2024, : 447 - 452
  • [8] Fine-Grained Visual Classification via Internal Ensemble Learning Transformer
    Xu, Qin
    Wang, Jiahui
    Jiang, Bo
    Luo, Bin
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 9015 - 9028
  • [9] CNN-Transformer with Stepped Distillation for Fine-Grained Visual Classification
    Xu, Qin
    Liu, Peng
    Wang, Jiahui
    Huang, Lili
    Tang, Jin
    PATTERN RECOGNITION AND COMPUTER VISION, PT IX, PRCV 2024, 2025, 15039 : 364 - 377
  • [10] Multi-part Token Transformer with Dual Contrastive Learning for Fine-grained Image Classification
    Wang, Chuanming
    Fu, Huiyuan
    Ma, Huadong
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 7648 - 7656