ENHANCING IMAGE-TEXT MATCHING WITH ADAPTIVE FEATURE AGGREGATION

被引:4
作者
Wang, Zuhui [1 ]
Yin, Yunting [1 ]
Ramakrishnant, I., V [1 ]
机构
[1] SUNY Stony Brook, Stony Brook, NY 11794 USA
来源
2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2024) | 2024年
关键词
triplet ranking loss; feature enhancement; cross-modal retrieval; image-text matching;
D O I
10.1109/ICASSP48485.2024.10446913
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Image-text matching aims to find matched cross-modal pairs accurately. While current methods often rely on projecting cross-modal features into a common embedding space, they frequently suffer from imbalanced feature representations across different modalities, leading to unreliable retrieval results. To address these limitations, we introduce a novel Feature Enhancement Module that adaptively aggregates single-modal features for more balanced and robust image-text retrieval. Additionally, we propose a new loss function that overcomes the shortcomings of original triplet ranking loss, thereby significantly improving retrieval performance. The proposed model has been evaluated on two public datasets and achieves competitive retrieval performance when compared with several state-of-the-art models. Implementation codes can be found here.
引用
收藏
页码:8245 / 8249
页数:5
相关论文
共 25 条
[1]   VQA: Visual Question Answering [J].
Antol, Stanislaw ;
Agrawal, Aishwarya ;
Lu, Jiasen ;
Mitchell, Margaret ;
Batra, Dhruv ;
Zitnick, C. Lawrence ;
Parikh, Devi .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2425-2433
[2]   Learning the Best Pooling Strategy for Visual Semantic Embedding [J].
Chen, Jiacheng ;
Hu, Hexiang ;
Wu, Hao ;
Jiang, Yuning ;
Wang, Changhu .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :15784-15793
[3]   INTRA-MODAL CONSTRAINT LOSS FOR IMAGE-TEXT RETRIEVAL [J].
Chen, Jianan ;
Zhang, Lu ;
Wang, Qiong ;
Bai, Cong ;
Kpalma, Kidiyo .
2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, :4023-4027
[4]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[5]   Plug-and-Play Regulators for Image-Text Matching [J].
Diao, Haiwen ;
Zhang, Ying ;
Liu, Wei ;
Ruan, Xiang ;
Lu, Huchuan .
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2023, 32 :2322-2334
[6]  
Diao HW, 2021, AAAI CONF ARTIF INTE, V35, P1218
[7]  
Faghri Fartash, 2018, BRIT MACH VIS C
[8]  
He K, 2016, PROC CVPR IEEE, P770, DOI [10.1109/CVPR.2016.90, DOI 10.1109/CVPR.2016.90]
[9]  
Ji Z, 2021, PROCEEDINGS OF THE THIRTIETH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2021, P765
[10]  
Karpathy A, 2015, PROC CVPR IEEE, P3128, DOI 10.1109/CVPR.2015.7298932