Dual Semantic Relationship Attention Network for Image-Text Matching

被引：0

作者：

Wen, Keyu ^{[1
]}

Gu, Xiaodong ^{[1
]}

机构：

[1] Fudan Univ, Dept Elect Engn, Shanghai 200433, Peoples R China

来源：

2020 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN) | 2020年

基金：

中国国家自然科学基金;

关键词：

cross-modal; retrieval; attention; semantic relationship;

D O I：

10.1109/ijcnn48605.2020.9206782

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Image-Text Matching is one major task in cross-modal information processing. The main challenge is to learn the unified vision and language representations. Previous methods that perform well on this task primarily focus on the region features in images corresponding to the words in sentences. However, this will cause the regional features to lose contact with the global context, leading to the mismatch with those non-object words in some sentences. In this work, in order to alleviate this problem, a novel Dual Semantic Relationship Attention Network is proposed which mainly consists of two modules, separate semantic relationship module and the joint semantic relationship module. With these two modules, different hierarchies of semantic relationships are learned simultaneously, thus promoting the image-text matching process. Quantitative experiments have been performed on MS-COCO and Flickr-30K and our method outperforms previous approaches by a large margin due to the effectiveness of the dual semantic relationship attention scheme.

引用

页数：7

共 36 条

[1] Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments
Anderson, Peter
Wu, Qi
Teney, Damien
Bruce, Jake
Johnson, Mark
Sunderhauf, Niko
Reid, Ian
Gould, Stephen
van den Hengel, Anton
[J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 3674 - 3683
[2] [Anonymous], ARXIV170705612
[3] Learning Implicit Fields for Generative Shape Modeling
Chen, Zhiqin
Zhang, Hao
[J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 5932 - 5941
[4] Cho K., 2014, P EMPIRICAL METHODS, P1724, DOI DOI 10.3115/V1/D14-1179
[5] Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848
[6] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[7] Graves A, 2012, STUD COMPUT INTELL, V385, P1, DOI [10.1162/neco.1997.9.1.1, 10.1007/978-3-642-24797-2]
[8] Canonical correlation analysis: An overview with application to learning methods
Hardoon, DR
Szedmak, S
Shawe-Taylor, J
[J]. NEURAL COMPUTATION, 2004, 16 (12) : 2639 - 2664
[9] He K., 2016, European conference on computer vision, P630, DOI [10.1007/978-3-319-46493-038, DOI 10.1007/978-3-319-46493-0_38, 10.1007/978-3-319-46493-0_38, DOI 10.1109/CVPR.2016.90]
[10] Huang Yan, 2018, P IEEE C COMP VIS PA

← 1 2 3 4 →