PRNet: A Progressive Refinement Network for referring image segmentation

被引：0

作者：

Liu, Jing ^{[1
]}

Jiang, Huajie ^{[1
]}

Hu, Yongli ^{[1
]}

Yin, Baocai ^{[1
]}

机构：

[1] Beijing Univ Technol, Beijing Inst Artificial Intelligence, Fac Informat Technol, Beijing Key Lab Multimedia & Intelligent Software, 100 Pingleyuan, Beijing 100124, Peoples R China

来源：

NEUROCOMPUTING | 2025年 / 630卷

基金：

北京市自然科学基金; 国家重点研发计划; 中国国家自然科学基金;

关键词：

Referring image segmentation; Position prior; Features alignment; Progressive localization; Transformer;

D O I：

10.1016/j.neucom.2025.129698

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The effective feature alignment between language and image is necessary for correctly inferring the location of reference instances in the referring image segmentation (RIS) task. Previous studies usually resort to assisting target localization with the help of external detectors or using a coarse-grained positional prior during multimodal feature fusion to implicitly enhance the modal alignment capability. However, these approaches are either limited by the performance of the external detector and the design of the matching algorithm, or ignore the fine-grained features in the reference information when using the coarse-grained prior processing, which may lead to inaccurate segmentation results. In this paper, we propose anew RIS network, Progressive Refinement Network (PRNet), which aims to gradually improve the alignment quality between language and image from coarse to fine. The core of the PRNet is the Progressive Refinement Localization Scheme (PRLS), which consists of a Coarse Positional Prior Module (CPPM) and a Refined Localization Module (RLM). The CPPM obtains rough prior positional information and corresponding semantic features by calculating the similarity matrix between sentence and image. The RLM fuses information from the visual and language modalities by densely aligning pixels with word features and utilizes the prior positional information generated by the CPPM to enhance the textual semantic understanding, thus guiding the model to perceive the position of the reference instance more accurately. Experimental results show that the proposed PRNet performs well on all three public datasets, RefCOCO, RefCOCO+, and RefCOCOg.

引用

页数：12

共 62 条

[41]

Redmon J., 2018, arXiv, DOI DOI 10.48550/ARXIV.1804.02767

[42] Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks [J].

Ren, Shaoqing ;

He, Kaiming ;

Girshick, Ross ;

Sun, Jian .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2017, 39 (06) :1137-1149

[43] Key-Word-Aware Network for Referring Expression Image Segmentation [J].

Shi, Hengcan ;

Li, Hongliang ;

Meng, Fanman ;

Wu, Qingbo .

COMPUTER VISION - ECCV 2018, PT VI, 2018, 11210 :38-54

[44] Multi-YOLOv8: An infrared moving small object detection model based on YOLOv8 for air vehicle [J].

Sun, Shizun ;

Mo, Bo ;

Xu, Junwei ;

Li, Dawei ;

Zhao, Jie ;

Han, Shuo .

NEUROCOMPUTING, 2024, 588

[45] BAFN: Bi-Direction Attention Based Fusion Network for Multimodal Sentiment Analysis [J].

Tang, Jiajia ;

Liu, Dongjun ;

Jin, Xuanyu ;

Peng, Yong ;

Zhao, Qibin ;

Ding, Yu ;

Kong, Wanzeng .

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (04) :1966-1978

[46] Human-Centric Spatio-Temporal Video Grounding With Visual Transformers [J].

Tang, Zongheng ;

Liao, Yue ;

Liu, Si ;

Li, Guanbin ;

Jin, Xiaojie ;

Jiang, Hongxu ;

Yu, Qian ;

Xu, Dong .

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (12) :8238-8249

[47]

Vaswani A, 2017, ADV NEUR IN, V30

[48] Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation [J].

Wang, Xin ;

Huang, Qiuyuan ;

Celikyilmaz, Asli ;

Gao, Jianfeng ;

Shen, Dinghan ;

Wang, Yuan-Fang ;

Wang, William Yang ;

Zhang, Lei .

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :3622-6631

[49] MARS: Learning Modality-Agnostic Representation for Scalable Cross-Media Retrieval [J].

Wang, Yunbo ;

Peng, Yuxin .

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (07) :4765-4777

[50] CRIS: CLIP-Driven Referring Image Segmentation [J].

Wang, Zhaoqing ;

Lu, Yu ;

Li, Qiang ;

Tao, Xunqiang ;

Guo, Yandong ;

Gong, Mingming ;

Liu, Tongliang .

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, :11676-11685

← 1 2 3 4 5 6 7 →