Multi-granularity sequence generation for hierarchical image classification

被引：1

作者：

Liu, Xinda ^{[1
]}

Wang, Lili ^{[1
,2
]}

机构：

[1] Beihang Univ, State Key Lab Virtual Real Technol & Syst, Beijing 100191, Peoples R China

[2] Peng Cheng Lab, Shengzhen 518000, Peoples R China

来源：

COMPUTATIONAL VISUAL MEDIA | 2024年 / 10卷 / 02期

基金：

中国国家自然科学基金; 国家重点研发计划;

关键词：

hierarchical multi-granularity classification; vision and text transformer; sequence generation; fine-grained image recognition; cross-modality attention;

D O I：

10.1007/s41095-022-0332-2

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

Hierarchical multi-granularity image classification is a challenging task that aims to tag each given image with multiple granularity labels simultaneously. Existing methods tend to overlook that different image regions contribute differently to label prediction at different granularities, and also insufficiently consider relationships between the hierarchical multi-granularity labels. We introduce a sequence-to-sequence mechanism to overcome these two problems and propose a multi-granularity sequence generation (MGSG) approach for the hierarchical multi-granularity image classification task. Specifically, we introduce a transformer architecture to encode the image into visual representation sequences. Next, we traverse the taxonomic tree and organize the multi-granularity labels into sequences, and vectorize them and add positional information. The proposed multi-granularity sequence generation method builds a decoder that takes visual representation sequences and semantic label embedding as inputs, and outputs the predicted multi-granularity label sequence. The decoder models dependencies and correlations between multi-granularity labels through a masked multi-head self-attention mechanism, and relates visual information to the semantic label information through a cross-modality attention mechanism. In this way, the proposed method preserves the relationships between labels at different granularity levels and takes into account the influence of different image regions on labels with different granularities. Evaluations on six public benchmarks qualitatively and quantitatively demonstrate the advantages of the proposed method. Our project is available at https://github.com/liuxindazz/mgsg.

引用

页码：243 / 260

页数：18

共 65 条

[11] Destruction and Construction Learning for Fine-grained Image Recognition
Chen, Yue
Bai, Yalong
Zhang, Wei
Mei, Tao
[J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 5152 - 5161
[12] Chou P-Y, 2022, arXiv
[13] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[14] Hierarchical annotation of medical images
Dimitrovski, Ivica
Kocev, Dragi
Loskovska, Suzana
Dzeroski, Saso
[J]. PATTERN RECOGNITION, 2011, 44 (10-11) : 2436 - 2449
[15] Donahue J, 2014, PR MACH LEARN RES, V32
[16] Dosovitskiy A., 2021, INT C LEARNING REPRE, DOI DOI 10.48550/ARXIV.2010.11929
[17] HD-MTL: Hierarchical Deep Multi-Task Learning for Large-Scale Visual Recognition
Fan, Jianping
Zhao, Tianyi
Kuang, Zhenzhong
Zheng, Yu
Zhang, Ji
Yu, Jun
Peng, Jinye
[J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2017, 26 (04) : 1923 - 1938
[18] Weakly Supervised Complementary Parts Models for Fine-Grained Image Classification from the Bottom Up
Ge, Weifeng
Lin, Xiangru
Yu, Yizhou
[J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 3029 - 3038
[19] Attention mechanisms in computer vision: A survey
Guo, Meng-Hao
Xu, Tian-Xing
Liu, Jiang-Jiang
Liu, Zheng-Ning
Jiang, Peng-Tao
Mu, Tai-Jiang
Zhang, Song-Hai
Martin, Ralph R.
Cheng, Ming-Ming
Hu, Shi-Min
[J]. COMPUTATIONAL VISUAL MEDIA, 2022, 8 (03) : 331 - 368
[20] He J, 2022, AAAI CONF ARTIF INTE, P852

← 1 2 3 4 5 6 7 →