Semantic-Enhanced Attention Network for Image-Text Matching

被引:0
作者
Zhou, Huanxiao [1 ,2 ,3 ]
Geng, Yushui [1 ,2 ,4 ]
Zhao, Jing [1 ,2 ,3 ]
Ma, Xishan [1 ,2 ,3 ]
机构
[1] Qilu Univ Technol, Shandong Acad Sci, Key Lab Comp Power Network & Informat Secur, Minist Educ,Shandong Comp Sci Ctr, Jinan, Peoples R China
[2] Qilu Univ Technol, Shandong Acad Sci, Fac Comp Sci & Technol, Shandong Engn Res Ctr Big Data Appl Technol, Jinan, Peoples R China
[3] Shandong Fundamental Res Ctr Comp Sci, Shandong Prov Key Lab Comp Networks, Jinan, Peoples R China
[4] Qilu Univ Technol, Shandong Acad Sci, Grad Sch, Jinan, Peoples R China
来源
PROCEEDINGS OF THE 2024 27 TH INTERNATIONAL CONFERENCE ON COMPUTER SUPPORTED COOPERATIVE WORK IN DESIGN, CSCWD 2024 | 2024年
关键词
image-text matching; cross-modal retrieval; self-attention; fragment embeddings;
D O I
10.1109/CSCWD61410.2024.10580166
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Image-text matching is an important task in cross-modal information processing, which consists of evaluating the similarity between images and text. However, the data of the two modalities have different distributions and representations, which cannot be directly compared. Therefore, the two modalities need to be processed separately. Most existing image-text matching methods extract image fragments and text fragments separately, using an attention mechanism to establish relationships between regions in the image and words in the text. Subsequently, they aggregate the similarity of these images and text fragments. Regardless, not all fragment alignment is meaningful, and irrelevant fragment alignment will bring redundancy and reduce retrieval accuracy. In addition to that, there are contextual relationships between image regions and between text words, which should also be taken into account. In this paper, we simultaneously consider these two aspects and propose a Semantic-Enhanced Attention Network (SEAN). It first focuses on each modality, mining contextual relationships between fragments within each modality and aggregating contextual information into visual and textual embeddings. Then, we compute the similarity between all image regions and text fragments, focusing all attention on the region-word pairs with the highest similarity. Finally, we summarize these to infer the similarity between the image and text. Our method achieves competitive results on two generalized image text retrieval datasets, Flickr30K and MS-COCO.
引用
收藏
页码:1256 / 1261
页数:6
相关论文
共 31 条
[1]   IMRAM: Iterative Matching with Recurrent Attention Memory for Cross-Modal Image-Text Retrieval [J].
Chen, Hui ;
Ding, Guiguang ;
Liu, Xudong ;
Lin, Zijia ;
Liu, Ji ;
Han, Jungong .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :12652-12660
[2]   VirTex: Learning Visual Representations from Textual Annotations [J].
Desai, Karan ;
Johnson, Justin .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :11157-11168
[3]  
Diao HW, 2021, AAAI CONF ARTIF INTE, V35, P1218
[4]  
Faghri Fartash, 2017, ARXIV
[5]   Structured Multi-modal Feature Embedding and Alignment for Image-Sentence Retrieval [J].
Ge, Xuri ;
Chen, Fuhai ;
Jose, Joemon M. ;
Ji, Zhilong ;
Wu, Zhongqin ;
Liu, Xiao .
PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, :5185-5193
[6]   Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models [J].
Gu, Jiuxiang ;
Cai, Jianfei ;
Joty, Shafiq ;
Niu, Li ;
Wang, Gang .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :7181-7189
[7]   Deep Residual Learning for Image Recognition [J].
He, Kaiming ;
Zhang, Xiangyu ;
Ren, Shaoqing ;
Sun, Jian .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :770-778
[8]  
Hua Y, 2019, IEEE INT CONF ELECTR, P252, DOI [10.1109/ICEIEC.2019.8784597, 10.1109/iceiec.2019.8784597]
[9]   MHTN: Modal-Adversarial Hybrid Transfer Network for Cross-Modal Retrieval [J].
Huang, Xin ;
Peng, Yuxin ;
Yuan, Mingkuan .
IEEE TRANSACTIONS ON CYBERNETICS, 2020, 50 (03) :1047-1059
[10]   Learning Semantic Concepts and Order for Image and Sentence Matching [J].
Huang, Yan ;
Wu, Qi ;
Song, Chunfeng ;
Wang, Liang .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6163-6171