Unified Adaptive Relevance Distinguishable Attention Network for Image-Text Matching

被引：46

作者：

Zhang, Kun ^{[1
]}

Mao, Zhendong ^{[1
]}

Liu, An-An ^{[3
]}

Zhang, Yongdong ^{[1
,2
]}

机构：

[1] Univ Sci & Technol China, Sch Informat Sci & Technol, Hefei 230022, Anhui, Peoples R China

[2] Hefei Comprehens Natl Sci Ctr, Inst Artificial Intelligence, Hefei 230022, Anhui, Peoples R China

[3] Tianjin Univ, Sch Elect & Informat Engn, Tianjin 300072, Peoples R China

来源：

IEEE TRANSACTIONS ON MULTIMEDIA | 2023年 / 25卷

关键词：

Semantics; Optimization; Visualization; Training; Task analysis; Representation learning; Correlation; Image-text matching; attention network; unified adaptive relevance distinguishable learning;

D O I：

10.1109/TMM.2022.3141603

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Image-text matching, as a fundamental cross-modal task, bridges the gap between vision and language. The core is to accurately learn semantic alignment to find relevant shared semantics in image and text. Existing methods typically attend to all fragments with word-region similarity greater than empirical threshold zero as relevant shared semantics, e.g., via a ReLU operation that forces the negative to zero and maintains the positive. However, this fixed threshold is totally isolated with feature learning, which cannot adaptively and accurately distinguish the varying distributions of relevant and irrelevant word-region similarity in training, inevitably limiting the semantic alignment learning. To solve this issue, we propose a novel Unified Adaptive Relevance Distinguishable Attention (UARDA) mechanism, incorporating the relevance threshold into a unified learning framework, to maximally distinguish the relevant and irrelevant distributions to obtain better semantic alignment. Specifically, our method adaptively learns the optimal relevance boundary between these two distributions to improve the model to learn more discriminative features. The explicit relevance threshold is well integrated into similarity matching, which kills two birds with one stone as: (1) excluding the disturbances of irrelevant fragment contents to aggregate precisely relevant shared semantics for boosting matching accuracy, and (2) avoiding the calculation of irrelevant fragment queries for reducing retrieval time. Experimental results on benchmarks show that UARDA can substantially and consistently outperform state-of-the-arts, with relative rSum improvements of 2%-4% (16.9%-35.3% for baseline SCAN), and reducing the retrieval time by 50%-73%.

引用

页码：1320 / 1332

页数：13

共 50 条

[41] Dual Stream Relation Learning Network for Image-Text Retrieval
Wu, Dongqing
Li, Huihui
Gu, Cang
Guo, Lei
Liu, Hang
IEEE TRANSACTIONS ON MULTIMEDIA, 2025, 27 : 1551 - 1565
[42] Thangka Image-Text Matching Based on Adaptive Pooling Layer and Improved Transformer
Wang, Kaijie
Wang, Tiejun
Guo, Xiaoran
Xu, Kui
Wu, Jiao
APPLIED SCIENCES-BASEL, 2024, 14 (02):
[43] Asymmetric Polysemous Reasoning for Image-Text Matching
Zhang, Hongping
Yang, Ming
2023 23RD IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS, ICDMW 2023, 2023, : 1013 - 1022
[44] Learning and Integrating Multi-Level Matching Features for Image-Text Retrieval
Lan, Hong
Zhang, Pufen
IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 374 - 378
[45] An Image-Text Dual-Channel Union Network for Person Re-Identification
Qi, Baoguang
Chen, Yi
Liu, Qiang
He, Xiaohai
Qing, Linbo
Sheriff, Ray E.
Chen, Honggang
IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, 2023, 72 : 1 - 16
[46] IMAGE-TEXT MATCHING WITH SHARED SEMANTIC CONCEPTS
Miao Lanxin
2022 19TH INTERNATIONAL COMPUTER CONFERENCE ON WAVELET ACTIVE MEDIA TECHNOLOGY AND INFORMATION PROCESSING (ICCWAMTIP), 2022,
[47] Bi-Directional Spatial-Semantic Attention Networks for Image-Text Matching
Huang, Feiran
Zhang, Xiaoming
Zhao, Zhonghua
Li, Zhoujun
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2019, 28 (04) : 2008 - 2020
[48] Multi-level Symmetric Semantic Alignment Network for image-text matching
Wang, Wenzhuang
Di, Xiaoguang
Liu, Maozhen
Gao, Feng
NEUROCOMPUTING, 2024, 599
[49] Giving Text More Imagination Space for Image-text Matching
Dong, Xinfeng
Han, Longfei
Zhang, Dingwen
Liu, Li
Han, Junwei
Zhang, Huaxiang
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 6359 - 6368
[50] Unifying knowledge iterative dissemination and relational reconstruction network for image-text matching
Xie, Xiumin
Li, Zhixin
Tang, Zhenjun
Yao, Dan
Ma, Huifang
INFORMATION PROCESSING & MANAGEMENT, 2023, 60 (01)

← 1 2 3 4 5 →