Improving weakly supervised phrase grounding via visual representation contextualization with contrastive learning

被引:0
作者
Wang, Xue [1 ,2 ]
Du, Youtian [1 ]
Verberne, Suzan [2 ]
Verbeek, Fons J. [2 ]
机构
[1] Xi An Jiao Tong Univ, Fac Elect & Informat Engn, Xian 710049, Peoples R China
[2] Leiden Univ, Leiden Inst Adv Comp Sci, NL-2333 CA Leiden, Netherlands
基金
国家重点研发计划; 中国国家自然科学基金;
关键词
Visual representation; Phrase grounding; Contrastive learning; Weakly supervised learning; NMS;
D O I
10.1007/s10489-022-04259-9
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Weakly supervised phrase grounding aims to map the phrases in an image caption to the objects appearing in the image under the supervision of image-caption correspondence. We observe that the current studies are insufficient to model the complicated interactions between the visual components (i.e., the visual regions) and between the visual and textual components (i.e., the phrases). Therefore, this paper presents a novel weakly supervised learning approach to phrase grounding in which we systematically model the visual contextualized representation with three modules: (1) object proposals pooling (OPP), (2) visual self-attention (VSA) and (3) visual-textual cross-modal attention (VTCA). OPP alleviates the suppression of the object proposals and benefits the visual representation in terms of trading off the richness of the visual components and the computational efficiency. VSA aims to capture the correlation among the object proposals and generate a representation of each proposal by incorporating the visual information of the others. To measure the cross-modal compatibility in terms of topics, we introduce the VTCA module to represent the visual topic corresponding to each textual component in a cross-modal common vector space. In the training process, we build a mixed contrastive loss function by considering both the cross-modal compatibility and the differences in the visual representations in the VSA module. Compared with the state-of-the-art methods, the proposed approach improves the performance by 3.88% points and 1.24% points on R@1, and by 2.23% points and 0.26% points on Pt_Acc, when trained on the MS COCO and Flickr30K Entities training sets, respectively. We have made our code available for follow-up research.
引用
收藏
页码:14690 / 14702
页数:13
相关论文
共 41 条
[1]   Multi-Level Multimodal Common Semantic Space for Image-Phrase Grounding [J].
Akbari, Hassan ;
Karaman, Svebor ;
Bhargava, Surabhi ;
Chen, Brian ;
Vondrick, Carl ;
Chang, Shih-Fu .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :12468-12478
[2]   Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering [J].
Anderson, Peter ;
He, Xiaodong ;
Buehler, Chris ;
Teney, Damien ;
Johnson, Mark ;
Gould, Stephen ;
Zhang, Lei .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6077-6086
[3]   VQA: Visual Question Answering [J].
Antol, Stanislaw ;
Agrawal, Aishwarya ;
Lu, Jiasen ;
Mitchell, Margaret ;
Batra, Dhruv ;
Zitnick, C. Lawrence ;
Parikh, Devi .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2425-2433
[4]   G3RAPHGROUND: Graph-based Language Grounding [J].
Bajaj, Mohit ;
Wang, Lanjun ;
Sigal, Leonid .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :4280-4289
[5]   Soft-NMS - Improving Object Detection With One Line of Code [J].
Bodla, Navaneeth ;
Singh, Bharat ;
Chellappa, Rama ;
Davis, Larry S. .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :5562-5570
[6]   Knowledge Aided Consistency for Weakly Supervised Phrase Grounding [J].
Chen, Kan ;
Gao, Jiyang ;
Nevatia, Ram .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :4042-4050
[7]  
Chen L, 2021, AAAI CONF ARTIF INTE, V35, P1036
[8]  
Chen T, 2020, PR MACH LEARN RES, V119
[9]  
Chen XL, 2015, Arxiv, DOI arXiv:1504.00325
[10]  
Dai B, 2017, Arxiv, DOI [arXiv:1710.02534, DOI 10.48550/ARXIV.1710.02534]