Token-Selective Vision Transformer for fine-grained image recognition of marine organisms

被引：8

作者：

Si, Guangzhe ^{[1
]}

Xiao, Ying ^{[2
]}

Wei, Bin ^{[3
]}

Bullock, Leon Bevan ^{[4
]}

Wang, Yueyue ^{[5
]}

Wang, Xiaodong ^{[4
]}

机构：

[1] Ocean Univ China, Coll Elect Engn, Qingdao, Shandong, Peoples R China

[2] Hong Kong Univ Sci & Technol, Sch Sci, Hong Kong, Peoples R China

[3] Qingdao Univ, Affiliated Hosp, Shandong Key Lab Digital Med & Comp Assisted Surg, Qingdao, Shandong, Peoples R China

[4] Ocean Univ China, Coll Comp Sci & Technol, Qingdao, Shandong, Peoples R China

[5] Ocean Univ China, Comp Ctr, Qingdao, Shandong, Peoples R China

来源：

FRONTIERS IN MARINE SCIENCE | 2023年 / 10卷

基金：

中国国家自然科学基金;

关键词：

token-selective; self-attention; vision transformer; fine-grained image classification; marine organisms;

D O I：

10.3389/fmars.2023.1174347

中图分类号：

X [环境科学、安全科学];

学科分类号：

08 ; 0830 ;

摘要：

IntroductionThe objective of fine-grained image classification on marine organisms is to distinguish the subtle variations in the organisms so as to accurately classify them into subcategories. The key to accurate classification is to locate the distinguishing feature regions, such as the fish's eye, fins, or tail, etc. Images of marine organisms are hard to work with as they are often taken from multiple angles and contain different scenes, additionally they usually have complex backgrounds and often contain human or other distractions, all of which makes it difficult to focus on the marine organism itself and identify its most distinctive features. Related workMost existing fine-grained image classification methods based on Convolutional Neural Networks (CNN) cannot accurately enough locate the distinguishing feature regions, and the identified regions also contain a large amount of background data. Vision Transformer (ViT) has strong global information capturing abilities and gives strong performances in traditional classification tasks. The core of ViT, is a Multi-Head Self-Attention mechanism (MSA) which first establishes a connection between different patch tokens in a pair of images, then combines all the information of the tokens for classification. MethodsHowever, not all tokens are conducive to fine-grained classification, many of them contain extraneous data (noise). We hope to eliminate the influence of interfering tokens such as background data on the identification of marine organisms, and then gradually narrow down the local feature area to accurately determine the distinctive features. To this end, this paper put forwards a novel Transformer-based framework, namely Token-Selective Vision Transformer (TSVT), in which the Token-Selective Self-Attention (TSSA) is proposed to select the discriminating important tokens for attention computation which helps limits the attention to more precise local regions. TSSA is applied to different layers, and the number of selected tokens in each layer decreases on the basis of the previous layer, this method gradually locates the distinguishing regions in a hierarchical manner. ResultsThe effectiveness of TSVT is verified on three marine organism datasets and it is demonstrated that TSVT can achieve the state-of-the-art performance.

引用

页数：11

共 50 条

[31] An Integrated Transformer with Collaborative Tokens Mining for Fine-Grained Recognition
Yang, Weiwei
Yin, Jian
ELECTRONICS, 2023, 12 (12)
[32] Multi-level information fusion Transformer with background filter for fine-grained image recognition
Yu, Ying
Wang, Jinghui
Pedrycz, Witold
Miao, Duoqian
Qian, Jin
APPLIED INTELLIGENCE, 2024, 54 (17-18) : 8108 - 8119
[33] Fine-Grained Image Classification Model Based on Improved Transformer
Tian Zhansheng
Liu Libo
LASER & OPTOELECTRONICS PROGRESS, 2023, 60 (02)
[34] FineFormer: Fine-Grained Adaptive Object Transformer for Image Captioning
Wang, Bo
Zhang, Zhao
Fan, Jicong
Zhao, Mingbo
Zhan, Choujun
Xu, Mingliang
2022 IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM), 2022, : 508 - 517
[35] Destruction and Construction Learning for Fine-grained Image Recognition
Chen, Yue
Bai, Yalong
Zhang, Wei
Mei, Tao
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 5152 - 5161
[36] Progressive Learning Vision Transformer for Open Set Recognition of Fine-Grained Objects in Remote Sensing Images
Fu, Yimin
Liu, Zhunga
Zhang, Zuowei
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2023, 61
[37] Fine-grained Image Classification via Combining Vision and Language
He, Xiangteng
Peng, Yuxin
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 7332 - 7340
[38] Fine-grained Image-to-Image Transformation towards Visual Recognition
Xiong, Wei
He, Yutong
Zhang, Yixuan
Luo, Wenhan
Ma, Lin
Luo, Jiebo
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, : 5839 - 5848
[39] RAMS-Trans: Recurrent Attention Multi-scale Transformer for Fine-grained Image Recognition
Hu, Yunqing
Jin, Xuan
Zhang, Yin
Hong, Haiwen
Zhang, Jingfeng
He, Yuan
Xue, Hui
PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 4239 - 4248
[40] Fine grained food image recognition based on swin transformer
Xiao, Zhiyong
Diao, Guang
Deng, Zhaohong
JOURNAL OF FOOD ENGINEERING, 2024, 380

← 1 2 3 4 5 →