Dual Transformer With Multi-Grained Assembly for Fine-Grained Visual Classification

被引：18

作者：

Ji, Ruyi ^{[1
,2
]}

Li, Jiaying ^{[3
]}

Zhang, Libo ^{[1
]}

Liu, Jing ^{[4
,5
]}

Wu, Yanjun ^{[1
]}

机构：

[1] Chinese Acad Sci, State Key Lab Comp Sci, Inst Software, Beijing 100190, Peoples R China

[2] Univ Chinese Acad Sci, Sch Comp Sci & Technol, Beijing 101400, Peoples R China

[3] Beijing Informat Sci & Technol Univ, Sch Comp Sci, Beijing 100192, Peoples R China

[4] Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing 101400, Peoples R China

[5] Chinese Acad Sci, Inst Automat, Natl Lab Pattern Recognit, Beijing 100190, Peoples R China

来源：

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY | 2023年 / 33卷 / 09期

关键词：

Transformer; multi-grained assembly; fine-grained visual classification;

D O I：

10.1109/TCSVT.2023.3248791

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Fine-grained visual classification requires distinguishing sub-categories within the same super-category, which suffers from small inter-class and large intra-class variances. This paper aims to improve the FGVC task towards better performance, for which we deliver a novel dual Transformer framework (coined Dual-TR) with multi-grained assembly. The Dual-TR is well-designed to encode fine-grained objects by two parallel hierarchies, which is amenable to capturing the subtle yet discriminative cues via the self-attention mechanism in ViT. Specifically, we perform orthogonal multi-grained assembly within the Transformer structure for a more robust representation, i.e., intra-layer and inter-layer assembly. The former aims to explore the informative feature in various self-attention heads within the Transformer layer. The latter pays attention to the token assembly across Transformer layers. Meanwhile, we introduce the constraint of center loss to pull intra-class samples' compactness and push that of inter-class samples. Extensive experiments show that Dual-TR performs on par with the state-of-the-art methods on four public benchmarks, including CUB-200-2011, NABirds, iNaturalist2017, and Stanford Dogs. The comprehensive ablation studies further demonstrate the effectiveness of architectural design choices.

引用

页码：5009 / 5021

页数：13

共 67 条

[1] End-to-End Object Detection with Transformers [J].

Carion, Nicolas ;

Massa, Francisco ;

Synnaeve, Gabriel ;

Usunier, Nicolas ;

Kirillov, Alexander ;

Zagoruyko, Sergey .

COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :213-229

[2] CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification [J].

Chen, Chun-Fu ;

Fan, Quanfu ;

Panda, Rameswar .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :347-356

[3]

Conde M.V., 2021, arXiv

[4] Large Scale Fine-Grained Categorization and Domain-Specific Transfer Learning [J].

Cui, Yin ;

Song, Yang ;

Sun, Chen ;

Howard, Andrew ;

Belongie, Serge .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :4109-4118

[5] AP-CNN: Weakly Supervised Attention Pyramid Convolutional Neural Network for Fine-Grained Visual Classification [J].

Ding, Yifeng ;

Ma, Zhanyu ;

Wen, Shaoguo ;

Xie, Jiyang ;

Chang, Dongliang ;

Si, Zhongwei ;

Wu, Ming ;

Ling, Haibin .

IEEE TRANSACTIONS ON IMAGE PROCESSING, 2021, 30 :2826-2836

[6]

Dosovitskiy Alexey, 2021, P ICLR

[7] Exploiting Category Similarity-Based Distributed Labeling for Fine-Grained Visual Classification [J].

Du, Pengzhen ;

Sun, Zeren ;

Yao, Yazhou ;

Tang, Zhenmin .

IEEE ACCESS, 2020, 8 :186679-186690

[8] Fine-Grained Visual Classification via Progressive Multi-granularity Training of Jigsaw Patches [J].

Du, Ruoyi ;

Chang, Dongliang ;

Bhunia, Ayan Kumar ;

Xie, Jiyang ;

Ma, Zhanyu ;

Song, Yi-Zhe ;

Guo, Jun .

COMPUTER VISION - ECCV 2020, PT XX, 2020, 12365 :153-168

[9]

Dubey A., 2018, P ADV NEUR INF PROC, P635

[10]

Duke B, 2021, Arxiv, DOI arXiv:2101.08833

← 1 2 3 4 5 6 7 →