Multiscale feature fusion and enhancement in a transformer for the fine-grained visual classification of tree species

被引：1

作者：

Dong, Yanqi ^{[1
]}

Ma, Zhibin ^{[1
]}

Zi, Jiali ^{[1
]}

Xu, Fu ^{[1
,2
]}

Chen, Feixiang ^{[1
,2
]}

机构：

[1] Beijing Forestry Univ, Sch Informat Sci & Technol, Beijing 100083, Peoples R China

[2] Natl Forestry & Grassland Adm, Engn Res Ctr Forestry oriented Intelligent Informa, Beijing 100083, Peoples R China

来源：

ECOLOGICAL INFORMATICS | 2025年 / 86卷

基金：

国家重点研发计划;

关键词：

Fine-grained image classification; Tree classification; Swin transformer; Feature fusion; Feature enhancement; VISION TRANSFORMER; IDENTIFICATION; IMAGES; MODEL;

D O I：

10.1016/j.ecoinf.2025.103029

中图分类号：

Q14 [生态学（生物生态学）];

学科分类号：

071012 ; 0713 ;

摘要：

Accurate and rapid fine-grained visual classification (FGVC) of tree species within the same family can provide technical support for tree surveys, research, and conservation. However, FGVC faces challenges such as large intraclass differences and small interclass differences. Recognizing tree species within the same family requires focusing on and correlating overall and multiorgan features of the trees while mitigating the influence of complex natural backgrounds, occlusion effects and other factors. To address these challenges, we propose multiscale feature fusion (MFF) and enhancement in transformers to improve recognition performance. The method consists of a Swin transformer backbone, an MFF module, a discriminative feature enhancement (DFE) module, and a texture feature enhancement (TFE) module. The MFF module aims to strike a balance between global and local feature extraction. The DFE module is employed to mitigate the impact of background noise, whereas the TFE module is used to enhance the feature extraction associated with complex textures and spatial patterns. We conducted experiments on a constructed dataset of tree species from the same family, achieving a top-1 accuracy of 90.3 % and a top-3 accuracy of 96.8 %. In addition, the method performed well on three popular FGVC datasets, namely, the Flavia, Oxford Flowers, and PlantCLEF 2015 datasets, with top-1 accuracies of 100 %, 99.2 %, and 81.4 %, respectively. The ablation experiments and module visualizations also yielded satisfactory results. Thus, this work provides a solution to enhance the FGVC task.

引用

页数：13

共 80 条

[1]

Abualigah L., 2024, Metaheuristic Optimization Algorithms: Optimizers, Analysis, and Applications

[2] SWIN transformer based contrastive self-supervised learning for animal detection and classification [J].

Agilandeeswari, L. ;

Meena, S. Divya .

MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (07) :10445-10470

[3] Leveraging Swin Transformer for Local-to-Global Weakly Supervised Semantic Segmentation [J].

Ahmadi, Rozhan ;

Kasaei, Shohreh .

PROCEEDINGS OF THE 13TH IRANIAN/3RD INTERNATIONAL MACHINE VISION AND IMAGE PROCESSING CONFERENCE, MVIP, 2024, :117-123

[4] A Comprehensive Study on Torchvision Pre-trained Models for Fine-grained Inter-species Classification [J].

Albardi, Feras ;

Kabir, H. M. Dipu ;

Bhuiyan, Md Mahbub Islam ;

Kebria, Parham M. ;

Khosravi, Abbas ;

Nahavandi, Saeid .

2021 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS (SMC), 2021, :2767-2774

[5] All areWorth Words: A ViT Backbone for Diffusion Models [J].

Bao, Fan ;

Nie, Shen ;

Xue, Kaiwen ;

Cao, Yue ;

Li, Chongxuan ;

Su, Hang ;

Zhu, Jun .

2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, :22669-22679

[6] Hyperparameter-tuned batch-updated stochastic gradient descent: Plant species identification by using hybrid deep learning [J].

Barhate, Deepti ;

Pathak, Sunil ;

Dubey, Ashutosh Kumar .

ECOLOGICAL INFORMATICS, 2023, 75

[7]

Carpentier M, 2018, IEEE INT C INT ROBOT, P1075, DOI 10.1109/IROS.2018.8593514

[8] Understanding leaves in natural images - A model-based approach for tree species identification [J].

Cerutti, Guillaume ;

Tougne, Laure ;

Mille, Julien ;

Vacavant, Antoine ;

Coquin, Didier .

COMPUTER VISION AND IMAGE UNDERSTANDING, 2013, 117 (10) :1482-1501

[9]

Champ J., 2015, CLEF C LABS EV FOR C

[10] CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification [J].

Chen, Chun-Fu ;

Fan, Quanfu ;

Panda, Rameswar .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :347-356

← 1 2 3 4 5 6 7 8 →