Multi-view inter-modality representation with progressive fusion for image-text matching

被引:10
作者
Wu, Jie [1 ]
Wang, Leiquan [1 ]
Chen, Chenglizhao [1 ]
Lu, Jing [1 ]
Wu, Chunlei [1 ]
机构
[1] China Univ Petr East China, Coll Comp Sci & Technol, Qingdao, Peoples R China
基金
中国国家自然科学基金;
关键词
Image -text matching; Cross modal matching; Multi view; ATTENTION;
D O I
10.1016/j.neucom.2023.02.043
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recently, image-text matching has been intensively explored to bridge vision and language. Previous methods explore an inter-modality relationship between an image-text pair from the single-view feature. However, it is difficult to discover all the abundant information based on a single inter-modality relation-ship. In this paper, a novel Multi-View Inter-Modality Representation with Progressive Fusion (MIRPF) is developed to explore inter-modality relationships from multi-view features. The multi-view strategy provides more complementary and global semantic clues than single-view approaches. In particular, the multi-view inter-modality representation network is constructed to generate multiple inter -modality representations, which provide diverse views to discover the latent image-text relationships. Furthermore, the progressive fusion module is performed to fuse inter-modality features stepwise, which fully uses the inherent complementary between different views. Extensive experiments on Flickr30K and MSCOCO verify the superiority of MIRPF compared with several existing approaches. The code is available at: https://github.com/jasscia18/MIRPF. (C) 2023 Published by Elsevier B.V.
引用
收藏
页码:1 / 12
页数:12
相关论文
共 52 条
[1]  
Anderson P, 2018, PROC CVPR IEEE, P6077, DOI [10.1109/CVPR.2018.00636, 10.1002/ett.70087]
[2]   Is An Image Worth Five Sentences? A New Look into Semantics for Image-Text Matching [J].
Biten, Ali Furkan ;
Mafla, Andres ;
Gomez, Lluis ;
Karatzas, Dimosthenis .
2022 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2022), 2022, :2483-2492
[3]   IMRAM: Iterative Matching with Recurrent Attention Memory for Cross-Modal Image-Text Retrieval [J].
Chen, Hui ;
Ding, Guiguang ;
Liu, Xudong ;
Lin, Zijia ;
Liu, Ji ;
Han, Jungong .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :12652-12660
[4]   Probabilistic Embeddings for Cross-Modal Retrieval [J].
Chun, Sanghyuk ;
Oh, Seong Joon ;
de Rezende, Rafael Sampaio ;
Kalantidis, Yannis ;
Larlus, Diane .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :8411-8420
[5]  
Chung JY, 2014, Arxiv, DOI [arXiv:1412.3555, 10.48550/arXiv.1412.3555, DOI 10.48550/ARXIV.1412.3555]
[6]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[7]  
Diao HW, 2021, AAAI CONF ARTIF INTE, V35, P1218
[8]  
Faghri Fartash., 2018, BMVC, P12
[9]   Dynamic Fusion with Intra- and Inter-modality Attention Flow for Visual Question Answering [J].
Gao, Peng ;
Jiang, Zhengkai ;
You, Haoxuan ;
Lu, Pan ;
Hoi, Steven ;
Wang, Xiaogang ;
Li, Hongsheng .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :6632-6641
[10]   Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models [J].
Gu, Jiuxiang ;
Cai, Jianfei ;
Joty, Shafiq ;
Niu, Li ;
Wang, Gang .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :7181-7189