DRTN: Dual Relation Transformer Network with feature erasure and contrastive learning for multi-label image classification

被引：0

作者：

Zhou, Wei ^{[1
]}

Lin, Kang ^{[1
]}

Zheng, Zhijie ^{[1
]}

Chen, Dihu ^{[1
]}

Su, Tao ^{[1
]}

Hu, Haifeng ^{[1
]}

机构：

[1] Sun Yat Sen Univ, Sch Elect & Informat Technol, Guangzhou 510006, Guangdong, Peoples R China

来源：

NEURAL NETWORKS | 2025年 / 187卷

基金：

中国国家自然科学基金;

关键词：

Multi-label image classification; Transformer; Pseudo-region; Feature erasure; Contrastive learning; ATTENTION;

D O I：

10.1016/j.neunet.2025.107309

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The objective of multi-label image classification (MLIC) task is to simultaneously identify multiple objects present in an image. Several researchers directly flatten 2D feature maps into 1D grid feature sequences, and utilize Transformer encoder to capture the correlations of grid features to learn object relationships. Although obtaining promising results, these Transformer-based methods lose spatial information. In addition, current attention-based models often focus only on salient feature regions, but ignore other potential useful features that contribute to MLIC task. To tackle these problems, we present a novel Dual Relation Transformer Network (DRTN) for MLIC task, which can be trained in an end-to-end manner. Concretely, to compensate for the loss of spatial information of grid features resulting from the flattening operation, we adopt a grid aggregation scheme to generate pseudo-region features, which does not need to make additional expensive annotations to train object detector. Then, a new dual relation enhancement (DRE) module is proposed to capture correlations between objects using two different visual features, thereby complementing the advantages provided by both grid and pseudo-region features. After that, we design anew feature enhancement and erasure (FEE) module to learn discriminative features and mine additional potential valuable features. By using attention mechanism to discover the most salient feature regions and removing them with region-level erasure strategy, our FEE module is able to mine other potential useful features from the remaining parts. Further, we devise a novel contrastive learning (CL) module to encourage the foregrounds of salient and potential features to be closer, while pushing their foregrounds further away from background features. This manner compels our model to learn discriminative and valuable features more comprehensively. Extensive experiments demonstrate that DRTN method surpasses current MLIC models on three challenging benchmarks, i.e., MS-COCO 2014, PASCAL VOC 2007, and NUS-WIDE datasets.

引用

页数：14

共 80 条

[1]

Arandjelovic R, 2018, IEEE T PATTERN ANAL, V40, P1437, DOI [10.1109/CVPR.2016.572, 10.1109/TPAMI.2017.2711011]

[2] Semi-supervised robust deep neural networks for multi-label image classification [J].

Cevikalp, Hakan ;

Benligiray, Burak ;

Gerek, Omer Nezih .

PATTERN RECOGNITION, 2020, 100

[3]

Chen J., 2024, IET Image Processing

[4] Knowledge-Guided Multi-Label Few-Shot Learning for General Image Recognition [J].

Chen, Tianshui ;

Lin, Liang ;

Chen, Riquan ;

Hui, Xiaolu ;

Wu, Hefeng .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (03) :1371-1384

[5] Learning Semantic-Specific Graph Representation for Multi-Label Image Recognition [J].

Chen, Tianshui ;

Xu, Muxin ;

Hui, Xiaolu ;

Wu, Hefeng ;

Lin, Liang .

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :522-531

[6]

Chen T, 2020, PR MACH LEARN RES, V119

[7] Multi-Label Image Recognition with Graph Convolutional Networks [J].

Chen, Zhao-Min ;

Wei, Xiu-Shen ;

Wang, Peng ;

Guo, Yanwen .

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :5172-5181

[8] Attention-based Dropout Layer for Weakly Supervised Object Localization [J].

Choe, Junsuk ;

Shim, Hyunjung .

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :2214-2223

[9]

Chua T. S., 2009, P ACM INT C IM VID R, P1

[10] Batch DropBlock Network for Person Re-identification and Beyond [J].

Dai, Zuozhuo ;

Chen, Mingqiang ;

Gu, Xiaodong ;

Zhu, Siyu ;

Tan, Ping .

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :3690-3700

← 1 2 3 4 5 6 7 8 →