Dual Transformer With Multi-Grained Assembly for Fine-Grained Visual Classification

被引:18
作者
Ji, Ruyi [1 ,2 ]
Li, Jiaying [3 ]
Zhang, Libo [1 ]
Liu, Jing [4 ,5 ]
Wu, Yanjun [1 ]
机构
[1] Chinese Acad Sci, State Key Lab Comp Sci, Inst Software, Beijing 100190, Peoples R China
[2] Univ Chinese Acad Sci, Sch Comp Sci & Technol, Beijing 101400, Peoples R China
[3] Beijing Informat Sci & Technol Univ, Sch Comp Sci, Beijing 100192, Peoples R China
[4] Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing 101400, Peoples R China
[5] Chinese Acad Sci, Inst Automat, Natl Lab Pattern Recognit, Beijing 100190, Peoples R China
关键词
Transformer; multi-grained assembly; fine-grained visual classification;
D O I
10.1109/TCSVT.2023.3248791
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Fine-grained visual classification requires distinguishing sub-categories within the same super-category, which suffers from small inter-class and large intra-class variances. This paper aims to improve the FGVC task towards better performance, for which we deliver a novel dual Transformer framework (coined Dual-TR) with multi-grained assembly. The Dual-TR is well-designed to encode fine-grained objects by two parallel hierarchies, which is amenable to capturing the subtle yet discriminative cues via the self-attention mechanism in ViT. Specifically, we perform orthogonal multi-grained assembly within the Transformer structure for a more robust representation, i.e., intra-layer and inter-layer assembly. The former aims to explore the informative feature in various self-attention heads within the Transformer layer. The latter pays attention to the token assembly across Transformer layers. Meanwhile, we introduce the constraint of center loss to pull intra-class samples' compactness and push that of inter-class samples. Extensive experiments show that Dual-TR performs on par with the state-of-the-art methods on four public benchmarks, including CUB-200-2011, NABirds, iNaturalist2017, and Stanford Dogs. The comprehensive ablation studies further demonstrate the effectiveness of architectural design choices.
引用
收藏
页码:5009 / 5021
页数:13
相关论文
共 67 条
[1]   End-to-End Object Detection with Transformers [J].
Carion, Nicolas ;
Massa, Francisco ;
Synnaeve, Gabriel ;
Usunier, Nicolas ;
Kirillov, Alexander ;
Zagoruyko, Sergey .
COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :213-229
[2]   CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification [J].
Chen, Chun-Fu ;
Fan, Quanfu ;
Panda, Rameswar .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :347-356
[3]  
Conde M.V., 2021, arXiv
[4]   Large Scale Fine-Grained Categorization and Domain-Specific Transfer Learning [J].
Cui, Yin ;
Song, Yang ;
Sun, Chen ;
Howard, Andrew ;
Belongie, Serge .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :4109-4118
[5]   AP-CNN: Weakly Supervised Attention Pyramid Convolutional Neural Network for Fine-Grained Visual Classification [J].
Ding, Yifeng ;
Ma, Zhanyu ;
Wen, Shaoguo ;
Xie, Jiyang ;
Chang, Dongliang ;
Si, Zhongwei ;
Wu, Ming ;
Ling, Haibin .
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2021, 30 :2826-2836
[6]  
Dosovitskiy Alexey, 2021, P ICLR
[7]   Exploiting Category Similarity-Based Distributed Labeling for Fine-Grained Visual Classification [J].
Du, Pengzhen ;
Sun, Zeren ;
Yao, Yazhou ;
Tang, Zhenmin .
IEEE ACCESS, 2020, 8 :186679-186690
[8]   Fine-Grained Visual Classification via Progressive Multi-granularity Training of Jigsaw Patches [J].
Du, Ruoyi ;
Chang, Dongliang ;
Bhunia, Ayan Kumar ;
Xie, Jiyang ;
Ma, Zhanyu ;
Song, Yi-Zhe ;
Guo, Jun .
COMPUTER VISION - ECCV 2020, PT XX, 2020, 12365 :153-168
[9]  
Dubey A., 2018, P ADV NEUR INF PROC, P635
[10]  
Duke B, 2021, Arxiv, DOI arXiv:2101.08833