TransFGVC: transformer-based fine-grained visual classification

被引:0
|
作者
Shen, Longfeng [1 ,2 ,4 ]
Hou, Bin [1 ,4 ]
Jian, Yulei [1 ,2 ,4 ]
Tu, Xisong [1 ,4 ]
Zhang, Yingjie [1 ,4 ]
Shuai, Lingying [3 ]
Ge, Fangzhen [1 ,2 ,4 ]
Chen, Debao [1 ,2 ,4 ]
机构
[1] Huaibei Normal Univ, Sch Comp Sci & Technol, Anhui Engn Res Ctr Intelligent Comp & Applicat Cog, 100 Dongshen Rd, Huaibei 235000, Anhui, Peoples R China
[2] Hefei Comprehens Natl Sci Ctr, Inst Artificial Intelligence, 5089 Wangjiang West Rd, Hefei 230088, Anhui, Peoples R China
[3] Huaibei Normal Univ, Coll Life Sci, 100 Dongshan Rd, Huaibei 235000, Anhui, Peoples R China
[4] Huaibei Normal Univ, Anhui Big Data Res Ctr Univ Manage, 100 Dongshen Rd, Huaibei 235000, Anhui, Peoples R China
来源
VISUAL COMPUTER | 2025年 / 41卷 / 04期
基金
中国国家自然科学基金;
关键词
Computer vision; Fine-grained visual classification; LSTM; Swin Transformer; Birds-267-2022; dataset;
D O I
10.1007/s00371-024-03545-6
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Fine-grained visual classification (FGVC) aims to identify subcategories of objects within the same superclass. This task is challenging owing to high intra-class variance and low inter-class variance. The most recent methods focus on locating discriminative areas and then training the classification network to further capture the subtle differences among them. On the one hand, the detection network often obtains an entire part of the object, and positioning errors occur. On the other hand, these methods ignore the correlations between the extracted regions. We propose a novel highly scalable approach, called TransFGVC, that cleverly combines Swin Transformers with long short-term memory (LSTM) networks to address the above problems. The Swin Transformer is used to obtain remarkable visual tokens through self-attention layer stacking, and LSTM is used to model them globally, which not only accurately locates the discriminative region but also further introduces global information that is important for FGVC. The proposed method achieves competitive performance with accuracy rates of 92.7%, 91.4% and 91.5% using the public CUB-200-2011 and NABirds datasets and our Birds-267-2022 dataset, and the Params and FLOPs of our method are 25% and 27% lower, respectively, than the current SotA method HERBS. To effectively promote the development of FGVC, we developed the Birds-267-2022 dataset, which has 267 categories and 12,233 images.
引用
收藏
页码:2439 / 2459
页数:21
相关论文
共 50 条
  • [1] TransFGVC: transformer-based fine-grained visual classificationTransFGVC: transformer-based fine-grained visual classificationL. Shen et al.
    Longfeng Shen
    Bin Hou
    Yulei Jian
    Xisong Tu
    Yingjie Zhang
    Lingying Shuai
    Fangzhen Ge
    Debao Chen
    The Visual Computer, 2025, 41 (4) : 2439 - 2459
  • [2] Transformer-based descriptors with fine-grained region supervisions for visual place recognition
    Wang, Yuwei
    Qiu, Yuanying
    Cheng, Peitao
    Zhang, Junyu
    KNOWLEDGE-BASED SYSTEMS, 2023, 280
  • [3] Transformer-Based Few-Shot and Fine-Grained Image Classification Method
    Lu, Yan
    Wang, Yangping
    Wang, Wenrun
    Computer Engineering and Applications, 2023, 59 (23) : 219 - 227
  • [4] Hierarchical attention vision transformer for fine-grained visual classification
    Hu, Xiaobin
    Zhu, Shining
    Peng, Taile
    JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2023, 91
  • [5] Dual Transformer With Multi-Grained Assembly for Fine-Grained Visual Classification
    Ji, Ruyi
    Li, Jiaying
    Zhang, Libo
    Liu, Jing
    Wu, Yanjun
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (09) : 5009 - 5021
  • [6] Convolutionally Enhanced Feature Fusion Visual Transformer for Fine-Grained Visual Classification
    Huang, Min
    Zhu, Saixing
    Wang, Zehua
    Qu, Shuanghong
    2024 16TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND COMPUTING, ICMLC 2024, 2024, : 447 - 452
  • [7] Fine-Grained Visual Classification via Internal Ensemble Learning Transformer
    Xu, Qin
    Wang, Jiahui
    Jiang, Bo
    Luo, Bin
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 9015 - 9028
  • [8] Dual-Dependency Attention Transformer for Fine-Grained Visual Classification
    Cui, Shiyan
    Hui, Bin
    SENSORS, 2024, 24 (07)
  • [9] CNN-Transformer with Stepped Distillation for Fine-Grained Visual Classification
    Xu, Qin
    Liu, Peng
    Wang, Jiahui
    Huang, Lili
    Tang, Jin
    PATTERN RECOGNITION AND COMPUTER VISION, PT IX, PRCV 2024, 2025, 15039 : 364 - 377
  • [10] Leveraging Fine-Grained Labels to Regularize Fine-Grained Visual Classification
    Wu, Junfeng
    Yao, Li
    Liu, Bin
    Ding, Zheyuan
    PROCEEDINGS OF THE 11TH INTERNATIONAL CONFERENCE ON COMPUTER MODELING AND SIMULATION (ICCMS 2019) AND 8TH INTERNATIONAL CONFERENCE ON INTELLIGENT COMPUTING AND APPLICATIONS (ICICA 2019), 2019, : 133 - 136