More Than Just Attention: Improving Cross-Modal Attentions with Contrastive Constraints for Image-Text Matching

被引:7
作者
Chen, Yuxiao [1 ]
Yuan, Jianbo [2 ]
Zhao, Long [1 ]
Chen, Tianlang [2 ]
Luo, Rui [2 ]
Davis, Larry [2 ]
Metaxas, Dimitris N. [1 ]
机构
[1] Rutgers State Univ, New Brunswick, NJ 08901 USA
[2] Amazon Com Serv Inc, Seattle, WA USA
来源
2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV) | 2023年
关键词
D O I
10.1109/WACV56688.2023.00441
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Cross-modal attention mechanisms have been widely applied to the image-text matching task. They have achieved remarkable improvements thanks to their capability of learning fine-grained relevance across different modalities. However, the cross-modal attention models of existing methods could be sub-optimal and inaccurate because there is no direct supervision provided during the training process. In this work, we propose two novel training strategies, namely Contrastive Content Re-sourcing (CCR) and Contrastive Content Swapping (CCS) constraints, to address such limitations. These constraints supervise the training of cross-modal attention models in a contrastive learning manner without requiring explicit attention annotations. They are plug-in training strategies and can be generally integrated into existing cross-modal attention models. Additionally, we introduce three metrics, including Attention Precision, Recall, and F1-Score, to quantitatively measure the quality of learned attention models. We evaluate the proposed constraints by incorporating them into four state-of-the-art cross-modal attention-based image-text matching models. Experimental results on both Flickr30k and MS-COCO datasets demonstrate that integrating these constraints generally improves the model performance in terms of both retrieval performance and attention metrics.
引用
收藏
页码:4421 / 4429
页数:9
相关论文
共 27 条
[1]  
[Anonymous], 2014, Empirical evaluation of gated recurrent neural networks on sequence modeling
[2]  
Chen T., 2020, ARXIV200303669
[3]  
Chen Y., 2019, arXiv
[4]  
Diao HW, 2021, AAAI CONF ARTIF INTE, V35, P1218
[5]  
Faghri Fartash, 2017, arXiv
[6]  
Frome A., 2013, P NIPS, P2121
[7]  
Hodosh M., 2014, Trans. Assoc. Comput. Linguist., V2, P67
[8]   Bi-Directional Spatial-Semantic Attention Networks for Image-Text Matching [J].
Huang, Feiran ;
Zhang, Xiaoming ;
Zhao, Zhonghua ;
Li, Zhoujun .
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2019, 28 (04) :2008-2020
[9]   Instance-aware Image and Sentence Matching with Selective Multimodal LSTM [J].
Huang, Yan ;
Wang, Wei ;
Wang, Liang .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :7254-7262
[10]  
Kipf M., 2017, ICLR, P1