TransFGVC: transformer-based fine-grained visual classification

被引:0
|
作者
Shen, Longfeng [1 ,2 ,4 ]
Hou, Bin [1 ,4 ]
Jian, Yulei [1 ,2 ,4 ]
Tu, Xisong [1 ,4 ]
Zhang, Yingjie [1 ,4 ]
Shuai, Lingying [3 ]
Ge, Fangzhen [1 ,2 ,4 ]
Chen, Debao [1 ,2 ,4 ]
机构
[1] Huaibei Normal Univ, Sch Comp Sci & Technol, Anhui Engn Res Ctr Intelligent Comp & Applicat Cog, 100 Dongshen Rd, Huaibei 235000, Anhui, Peoples R China
[2] Hefei Comprehens Natl Sci Ctr, Inst Artificial Intelligence, 5089 Wangjiang West Rd, Hefei 230088, Anhui, Peoples R China
[3] Huaibei Normal Univ, Coll Life Sci, 100 Dongshan Rd, Huaibei 235000, Anhui, Peoples R China
[4] Huaibei Normal Univ, Anhui Big Data Res Ctr Univ Manage, 100 Dongshen Rd, Huaibei 235000, Anhui, Peoples R China
来源
VISUAL COMPUTER | 2025年 / 41卷 / 04期
基金
中国国家自然科学基金;
关键词
Computer vision; Fine-grained visual classification; LSTM; Swin Transformer; Birds-267-2022; dataset;
D O I
10.1007/s00371-024-03545-6
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Fine-grained visual classification (FGVC) aims to identify subcategories of objects within the same superclass. This task is challenging owing to high intra-class variance and low inter-class variance. The most recent methods focus on locating discriminative areas and then training the classification network to further capture the subtle differences among them. On the one hand, the detection network often obtains an entire part of the object, and positioning errors occur. On the other hand, these methods ignore the correlations between the extracted regions. We propose a novel highly scalable approach, called TransFGVC, that cleverly combines Swin Transformers with long short-term memory (LSTM) networks to address the above problems. The Swin Transformer is used to obtain remarkable visual tokens through self-attention layer stacking, and LSTM is used to model them globally, which not only accurately locates the discriminative region but also further introduces global information that is important for FGVC. The proposed method achieves competitive performance with accuracy rates of 92.7%, 91.4% and 91.5% using the public CUB-200-2011 and NABirds datasets and our Birds-267-2022 dataset, and the Params and FLOPs of our method are 25% and 27% lower, respectively, than the current SotA method HERBS. To effectively promote the development of FGVC, we developed the Birds-267-2022 dataset, which has 267 categories and 12,233 images.
引用
收藏
页码:2439 / 2459
页数:21
相关论文
共 50 条
  • [31] Fine-grained citation count prediction via a transformer-based model with among-attention mechanism
    Huang, Shengzhi
    Huang, Yong
    Bu, Yi
    Lu, Wei
    Qian, Jiajia
    Wang, Dan
    INFORMATION PROCESSING & MANAGEMENT, 2022, 59 (02)
  • [32] Fine-grained bird image classification based on counterfactual method of vision transformer model
    Chen, Tianhua
    Li, Yanyue
    Qiao, Qinghua
    JOURNAL OF SUPERCOMPUTING, 2024, 80 (05): : 6221 - 6239
  • [33] Multi-Scale Feature Transformer Based Fine-Grained Image Classification Method
    Zhang T.
    Cai C.
    Luo X.
    Zhu Y.
    Beijing Youdian Daxue Xuebao/Journal of Beijing University of Posts and Telecommunications, 2023, 46 (04): : 70 - 75
  • [34] Fine-grained bird image classification based on counterfactual method of vision transformer model
    Tianhua Chen
    Yanyue Li
    Qinghua Qiao
    The Journal of Supercomputing, 2024, 80 : 6221 - 6239
  • [35] Fine-Grained Visual Classification Based on Sparse Bilinear Convolutional Neural Network
    Ma L.
    Wang Y.
    Moshi Shibie yu Rengong Zhineng/Pattern Recognition and Artificial Intelligence, 2019, 32 (04): : 336 - 344
  • [36] TECMH: Transformer-Based Cross-Modal Hashing For Fine-Grained Image-Text Retrieval
    Li, Qiqi
    Ma, Longfei
    Jiang, Zheng
    Li, Mingyong
    Jin, Bo
    CMC-COMPUTERS MATERIALS & CONTINUA, 2023, 75 (02): : 3713 - 3728
  • [37] Fine-Grained Visual Classification Network Based on Fusion Pooling and Attention Enhancement
    Xiao B.
    Guo J.
    Zhang X.
    Wang M.
    Moshi Shibie yu Rengong Zhineng/Pattern Recognition and Artificial Intelligence, 2023, 36 (07): : 661 - 670
  • [38] Fine-grained activity classification in assembly based on multi-visual modalities
    Chen, Haodong
    Zendehdel, Niloofar
    Leu, Ming C.
    Yin, Zhaozheng
    JOURNAL OF INTELLIGENT MANUFACTURING, 2024, 35 (05) : 2215 - 2233
  • [39] Transformer-based statement level vulnerability detection by cross-modal fine-grained features capture
    Tao, Wenxin
    Su, Xiaohong
    Ke, Yekun
    Han, Yi
    Zheng, Yu
    Wei, Hongwei
    KNOWLEDGE-BASED SYSTEMS, 2025, 316
  • [40] Optimized lightweight CA-transformer: Using transformer for fine-grained visual categorization
    Wang, Haiqing
    Shang, Shuqi
    Wang, Dongwei
    He, Xiaoning
    Feng, Kai
    Zhu, Hao
    Li, Chengpeng
    Wang, Yuetao
    ECOLOGICAL INFORMATICS, 2022, 71