Progressive Semantic Reconstruction Network for Weakly Supervised Referring Expression Grounding

被引:0
作者
Ji, Zhong [1 ,2 ]
Wu, Jiahe [1 ]
Wang, Yaodong [1 ]
Yang, Aiping [1 ]
Han, Jungong [3 ]
机构
[1] Tianjin Univ, Sch Elect & Informat Engn, Tianjin Key Lab Brain Inspired Intelligence Techno, Tianjin 300072, Peoples R China
[2] Shanghai Artificial Intelligence Lab, Shanghai 200232, Peoples R China
[3] Univ Sheffield, Dept Comp Sci, Sheffield S10 2TN, England
基金
中国国家自然科学基金;
关键词
Image reconstruction; Semantics; Training; Grounding; Proposals; Detectors; Visualization; Referring expression grounding; weakly supervised; progressive semantic reconstruction; LANGUAGE;
D O I
10.1109/TCSVT.2024.3433547
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Weakly supervised Referring Expression Grounding (REG) aims to localize the target entity in an image based on a given expression, where the mapping between image regions and expressions is unknown during training. It faces two primary challenges. Firstly, conventional methods involve selecting regions to generate reconstructed texts for computing the backpropagation loss between regions and expressions. However, semantic deviations in text reconstruction may result in significant cross-modal bias, leading to substantial losses even in cases of correctly matched regions. Secondly, the absence of region-level ground truth in weakly supervised REG results in a lack of stable and reliable supervision during training. To tackle these challenges, we propose a Progressive Semantic Reconstruction Network (PSRN), which utilizes a two-level matching-reconstruction process based on the key triad and adaptive phrases, respectively. We leverage progressive semantic reconstruction with a three-staged training strategy to mitigate the deviations in the reconstructed texts. Additionally, we introduce a Constrained Interactions operation and an Attention Coordination mechanism to facilitate additional bidirectional supervision between the two matching processes. Experiments on three benchmark datasets of RefCOCO, RefCOCO+ and RefCOCOg demonstrate that the proposed PSRN has the competing results. Our source code will be released at https://github.com/5jiahe/psrn.
引用
收藏
页码:13058 / 13070
页数:13
相关论文
共 63 条
[1]  
Carion Nicolas, 2020, Computer Vision - ECCV 2020. 16th European Conference. Proceedings. Lecture Notes in Computer Science (LNCS 12346), P213, DOI 10.1007/978-3-030-58452-8_13
[2]   Grounding Answers for Visual Questions Asked by Visually Impaired People [J].
Chen, Chongyan ;
Anjum, Samreen ;
Gurari, Danna .
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :19076-19085
[3]   Knowledge Aided Consistency for Weakly Supervised Phrase Grounding [J].
Chen, Kan ;
Gao, Jiyang ;
Nevatia, Ram .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :4042-4050
[4]  
Cho J, 2021, PR MACH LEARN RES, V139
[5]   Align2Ground: Weakly Supervised Phrase Grounding Guided by Image-Caption Alignment [J].
Datta, Samyak ;
Sikka, Karan ;
Roy, Anirban ;
Ahuja, Karuna ;
Parikh, Devi ;
Divakaran, Ajay .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :2601-2610
[6]   TransVG plus plus : End-to-End Visual Grounding With Language Conditioned Vision Transformer [J].
Deng, Jiajun ;
Yang, Zhengyuan ;
Liu, Daqing ;
Chen, Tianlang ;
Zhou, Wengang ;
Zhang, Yanyong ;
Li, Houqiang ;
Ouyang, Wanli .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (11) :13636-13652
[7]   TransVG: End-to-End Visual Grounding with Transformers [J].
Deng, Jiajun ;
Yang, Zhengyuan ;
Chen, Tianlang ;
Zhou, Wengang ;
Li, Houqiang .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :1749-1759
[8]   Efficient Video Grounding With Which-Where Reading Comprehension [J].
Gao, Jialin ;
Sun, Xin ;
Ghanem, Bernard ;
Zhou, Xi ;
Ge, Shiming .
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (10) :6900-6913
[9]  
Gupta T, 2020, Img Proc Comp Vis Re, V12348, P752, DOI 10.1007/978-3-030-58580-8_44
[10]  
Honnibal M, 2017, IN PRESS, DOI DOI 10.3233/978-1-60750-588-4-1080