SST: Spatial and Semantic Transformers for Multi-Label Image Recognition

被引：54

作者：

Chen, Zhao-Min ^{[1
]}

Cui, Quan ^{[2
]}

Zhao, Borui ^{[3
]}

Song, Renjie ^{[3
]}

Zhang, Xiaoqin ^{[1
]}

Yoshie, Osamu ^{[2
]}

机构：

[1] Wenzhou Univ, Coll Comp Sci & Artificial Intelligence, Wenzhou 325035, Peoples R China

[2] Waseda Univ, Grad Sch Informat Prod & Syst, Fukuoka 8080135, Japan

[3] Megvii Technol, Megvii Res Nanjing, Nanjing 210009, Peoples R China

来源：

IEEE TRANSACTIONS ON IMAGE PROCESSING | 2022年 / 31卷

基金：

中国国家自然科学基金;

关键词：

Correlation; Semantics; Transformers; Image recognition; Task analysis; Training; Feature extraction; Multi-label image recognition; transformer; label correlation;

D O I：

10.1109/TIP.2022.3148867

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Multi-label image recognition has attracted considerable research attention and achieved great success in recent years. Capturing label correlations is an effective manner to advance the performance of multi-label image recognition. Two types of label correlations were principally studied, i.e., the spatial and semantic correlations. However, in the literature, previous methods considered only either of them. In this work, inspired by the great success of Transformer, we propose a plug-and-play module, named the Spatial and Semantic Transformers (SST), to simultaneously capture spatial and semantic correlations in multi-label images. Our proposal is mainly comprised of two independent transformers, aiming to capture the spatial and semantic correlations respectively. Specifically, our Spatial Transformer is designed to model the correlations between features from different spatial positions, while the Semantic Transformer is leveraged to capture the co-existence of labels without manually defined rules. Other than methodological contributions, we also prove that spatial and semantic correlations complement each other and deserve to be leveraged simultaneously in multi-label image recognition. Benefitting from the Transformer's ability to capture long-range correlations, our method remarkably outperforms state-of-the-art methods on four popular multi-label benchmark datasets. In addition, extensive ablation studies and visualizations are provided to validate the essential components of our method.

引用

页码：2570 / 2583

页数：14

共 42 条

[1] End-to-End Object Detection with Transformers [J].

Carion, Nicolas ;

Massa, Francisco ;

Synnaeve, Gabriel ;

Usunier, Nicolas ;

Kirillov, Alexander ;

Zagoruyko, Sergey .

COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :213-229

[2]

Chen SF, 2018, AAAI CONF ARTIF INTE, P6714

[3] Learning Semantic-Specific Graph Representation for Multi-Label Image Recognition [J].

Chen, Tianshui ;

Xu, Muxin ;

Hui, Xiaolu ;

Wu, Hefeng ;

Lin, Liang .

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :522-531

[4]

Chen TS, 2018, AAAI CONF ARTIF INTE, P6730

[5] Disentangling, Embedding and Ranking Label Cues for Multi-Label Image Recognition [J].

Chen, Zhao-Min ;

Cui, Quan ;

Wei, Xiu-Shen ;

Jin, Xin ;

Guo, Yanwen .

IEEE TRANSACTIONS ON MULTIMEDIA, 2021, 23 :1827-1840

[6] Multi-Label Image Recognition with Graph Convolutional Networks [J].

Chen, Zhao-Min ;

Wei, Xiu-Shen ;

Wang, Peng ;

Guo, Yanwen .

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :5172-5181

[7]

Chua T.S., 2009, P ACM INT C IM VID R

[8]

Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848

[9]

Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171

[10]

Dosovitskiy A, 2020, ARXIV

← 1 2 3 4 5 →