PRNet: A Progressive Refinement Network for referring image segmentation

被引:0
作者
Liu, Jing [1 ]
Jiang, Huajie [1 ]
Hu, Yongli [1 ]
Yin, Baocai [1 ]
机构
[1] Beijing Univ Technol, Beijing Inst Artificial Intelligence, Fac Informat Technol, Beijing Key Lab Multimedia & Intelligent Software, 100 Pingleyuan, Beijing 100124, Peoples R China
基金
北京市自然科学基金; 国家重点研发计划; 中国国家自然科学基金;
关键词
Referring image segmentation; Position prior; Features alignment; Progressive localization; Transformer;
D O I
10.1016/j.neucom.2025.129698
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The effective feature alignment between language and image is necessary for correctly inferring the location of reference instances in the referring image segmentation (RIS) task. Previous studies usually resort to assisting target localization with the help of external detectors or using a coarse-grained positional prior during multimodal feature fusion to implicitly enhance the modal alignment capability. However, these approaches are either limited by the performance of the external detector and the design of the matching algorithm, or ignore the fine-grained features in the reference information when using the coarse-grained prior processing, which may lead to inaccurate segmentation results. In this paper, we propose anew RIS network, Progressive Refinement Network (PRNet), which aims to gradually improve the alignment quality between language and image from coarse to fine. The core of the PRNet is the Progressive Refinement Localization Scheme (PRLS), which consists of a Coarse Positional Prior Module (CPPM) and a Refined Localization Module (RLM). The CPPM obtains rough prior positional information and corresponding semantic features by calculating the similarity matrix between sentence and image. The RLM fuses information from the visual and language modalities by densely aligning pixels with word features and utilizes the prior positional information generated by the CPPM to enhance the textual semantic understanding, thus guiding the model to perceive the position of the reference instance more accurately. Experimental results show that the proposed PRNet performs well on all three public datasets, RefCOCO, RefCOCO+, and RefCOCOg.
引用
收藏
页数:12
相关论文
共 62 条
[1]  
Bi Y., Jiang H., Hu Y., Sun Y., Yin B., See and learn more: Dense caption-aware representation for visual question answering, IEEE Trans. Circuits Syst. Video Technol., 34, 2, pp. 1135-1146, (2024)
[2]  
Tang J., Liu D., Jin X., Peng Y., Zhao Q., Ding Y., Kong W., Bafn: Bi-direction attention based fusion network for multimodal sentiment analysis, IEEE Trans. Circuits Syst. Video Technol., 33, 4, pp. 1966-1978, (2022)
[3]  
Zhu W., Wang X., Li H., Multi-modal deep analysis for multimedia, IEEE Trans. Circuits Syst. Video Technol., 30, 10, pp. 3740-3764, (2019)
[4]  
Wang X., Huang Q., Celikyilmaz A., Gao J., Shen D., Wang Y.-F., Wang W.Y., Zhang L., Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation,, pp. 6629-6638, (2019)
[5]  
Pathak D., Mahmoudieh P., Luo G., Agrawal P., Chen D., Shentu Y., Shelhamer E., Malik J., Efros A.A., Darrell T., Zero-shot visual imitation,, pp. 2050-2053, (2018)
[6]  
Chen J., Shen Y., Gao J., Liu J., Liu X., Language-based image editing with recurrent attentive models,, pp. 8721-8729, (2018)
[7]  
Chen H., Li C., Wang G., Li X., Mamunur Rahaman M., Sun H., Hu W., Li Y., Liu W., Sun C., Ai S., Grzegorzek M., GasHis-transformer: A multi-scale visual transformer approach for gastric histopathological image detection, Pattern Recognit., 130, (2022)
[8]  
Zhang J., Li C., Kosov S., Grzegorzek M., Shirahama K., Jiang T., Sun C., Li Z., Li H., LCU-net: A novel low-cost U-net for environmental microorganism image segmentation, Pattern Recognit., 115, (2021)
[9]  
Yu X., Lu X., Domain adaptation of anchor-free object detection for urban traffic, Neurocomputing, 582, (2024)
[10]  
Sun S., Mo B., Xu J., Li D., Zhao J., Han S., Multi-YOLOv8: An infrared moving small object detection model based on YOLOv8 for air vehicle, Neurocomputing, 588, (2024)