Progressive Semantic Reconstruction Network for Weakly Supervised Referring Expression Grounding

被引:0
作者
Ji, Zhong [1 ,2 ]
Wu, Jiahe [1 ]
Wang, Yaodong [1 ]
Yang, Aiping [1 ]
Han, Jungong [3 ]
机构
[1] Tianjin Univ, Sch Elect & Informat Engn, Tianjin Key Lab Brain Inspired Intelligence Techno, Tianjin 300072, Peoples R China
[2] Shanghai Artificial Intelligence Lab, Shanghai 200232, Peoples R China
[3] Univ Sheffield, Dept Comp Sci, Sheffield S10 2TN, England
基金
中国国家自然科学基金;
关键词
Image reconstruction; Semantics; Training; Grounding; Proposals; Detectors; Visualization; Referring expression grounding; weakly supervised; progressive semantic reconstruction; LANGUAGE;
D O I
10.1109/TCSVT.2024.3433547
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Weakly supervised Referring Expression Grounding (REG) aims to localize the target entity in an image based on a given expression, where the mapping between image regions and expressions is unknown during training. It faces two primary challenges. Firstly, conventional methods involve selecting regions to generate reconstructed texts for computing the backpropagation loss between regions and expressions. However, semantic deviations in text reconstruction may result in significant cross-modal bias, leading to substantial losses even in cases of correctly matched regions. Secondly, the absence of region-level ground truth in weakly supervised REG results in a lack of stable and reliable supervision during training. To tackle these challenges, we propose a Progressive Semantic Reconstruction Network (PSRN), which utilizes a two-level matching-reconstruction process based on the key triad and adaptive phrases, respectively. We leverage progressive semantic reconstruction with a three-staged training strategy to mitigate the deviations in the reconstructed texts. Additionally, we introduce a Constrained Interactions operation and an Attention Coordination mechanism to facilitate additional bidirectional supervision between the two matching processes. Experiments on three benchmark datasets of RefCOCO, RefCOCO+ and RefCOCOg demonstrate that the proposed PSRN has the competing results. Our source code will be released at https://github.com/5jiahe/psrn.
引用
收藏
页码:13058 / 13070
页数:13
相关论文
共 63 条
[1]  
Zhang W., Ma C., Wu Q., Yang X., Language-guided navigation via cross-modal grounding and alternate adversarial learning, IEEE Trans. Circuits Syst. Video Technol., 31, 9, pp. 3469-3481, (2021)
[2]  
Wang L., He Z., Dang R., Chen H., Liu C., Chen Q., RES-StS: Referring expression speaker via self-training with scorer for goal-oriented vision-language navigation, IEEE Trans. Circuits Syst. Video Technol., 33, 7, pp. 3441-3454, (2023)
[3]  
Wang X.E., Jain V., Ie E., Wang W.Y., Kozareva Z., Ravi S., Environment-agnostic multitask learning for natural language grounded navigation, in Proc. Eur. Conf. Comput. Vis. (ECCV)., pp. 413-430, (2020)
[4]  
Zhang H., Lu Y., Yu C., Hsu D., Lan X., Zheng N., INVIGORATE: Interactive visual grounding and grasping in clutter, (2021)
[5]  
Nawaz H.S., Shi Z., Gan Y., Hirpa A., Dong J., Zheng H., Temporal moment localization via natural language by utilizing video question answers as a special variant and bypassing NLP for corpora, IEEE Trans. Circuits Syst. Video Technol., 32, 9, pp. 6174-6185, (2022)
[6]  
Zhao L., Et al., Towards explainable 3D grounded visual question answering: A new benchmark and strong baseline, IEEE Trans. Circuits Syst. Video Technol., 33, 6, pp. 2935-2949, (2022)
[7]  
Chen C., Anjum S., Gurari D., Grounding answers for visual questions asked by visually impaired people, in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 19076-19085, (2022)
[8]  
Zhang Y., Ji Z., Pang Y., Li X., Consensus knowledge exploitation for partial query based image retrieval, IEEE Trans. Circuits Syst. Video Technol., 33, 12, pp. 7900-7913, (2023)
[9]  
Ji Z., Meng C., Zhang Y., Pang Y., Li X., Knowledge-aided momentum contrastive learning for remote-sensing image text retrieval, IEEE Trans. Geosci. Remote Sens., 61, (2023)
[10]  
Liu Z., Chen F., Xu J., Pei W., Lu G., Image-text retrieval with cross-modal semantic importance consistency, IEEE Trans. Circuits Syst. Video Technol., 33, 5, pp. 2465-2476, (2023)