Open-Vocabulary Instance Segmentation-Boundary IS-Goal

被引:0
作者
Tang, Quan [1 ]
机构
[1] Wuhan Univ Technol, Wuhan, Peoples R China
来源
PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2024, PT IV | 2025年 / 15034卷
关键词
Instance segmentation; Boundary detection; Open-vocabulary; Multi-modal fusion;
D O I
10.1007/978-981-97-8505-6_30
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Accurate delineation of boundaries and instance semantics is crucial for tasks like object localization in robotic arm grasping, and vehicle and pedestrian detection in autonomous driving. While research often focuses on improving instance segmentation accuracy and lightweight models, the importance of boundary detection and open-vocabulary capabilities for human-level perception is often overlooked. In this work, we propose a lightweight visual-language dual-task framework, IS-Goal, that simultaneously performs instance segmentation and boundary detection under open-vocabulary. It includes a prompt text encoder, a two-stream image encoder, and a visual-language adaptive weight decoder (VL-AWD) for multi-level cross-modal feature fusion. The text encoder extracts text embeddings, the two-stream image encoder captures instance and boundary features, and the VL-AWD module learns channel relationships to obtain adaptive weight allocation for instance features and instance boundary features, enabling multi-modal fusion. Additionally, we introduce a regularization loss to mitigate the conflicts in dual-task learning and diverse deep supervision. Compared to existing methods, IS-Goal improves instance segmentation and boundary detection performance under open-vocabulary. We first validate IS-Goal's effectiveness in open-vocabulary instance segmentation tasks on the MS COCO dataset for identifying and distinguishing new categories from base categories. Subsequently, on the LVIS dataset, IS-Goal surpasses existing dual-task methods with a boundary AP of 27.5%, instance segmentation AP of 37.3%, and ODS/OIS scores of 67.7/68.2. Zero-shot performance on PASCAL VOC2012 is demonstrated with an inference speed of 15.7 FPS on an RTX 2080 Ti GPU with 500 x 500 input resolution.
引用
收藏
页码:420 / 435
页数:16
相关论文
共 47 条
[1]   Devil is in the Edges: Learning Semantic Boundaries from Noisy Annotations [J].
Acuna, David ;
Kar, Amlan ;
Fidler, Sanja .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :11067-11075
[2]   YOLACT Real-time Instance Segmentation [J].
Bolya, Daniel ;
Zhou, Chong ;
Xiao, Fanyi ;
Lee, Yong Jae .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :9156-9165
[3]  
Chen X., 2021, arXiv
[4]  
Cheng T., 2020, COMPUTER VISION ECCV
[5]  
Cheng TH, 2024, Arxiv, DOI [arXiv:2401.17270, DOI 10.48550/ARXIV.2401.17270, 10.48550/arXiv.2401.17270]
[6]   ToothNet: Automatic Tooth Instance Segmentation and Identification from Cone Beam CT Images [J].
Cui, Zhiming ;
Li, Changjian ;
Wang, Wenping .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :6361-6370
[7]  
Devlin J, 2019, Arxiv, DOI [arXiv:1810.04805, DOI 10.48550/ARXIV.1810.04805]
[8]  
Dosovitskiy A, 2021, Arxiv, DOI arXiv:2010.11929
[9]  
Howard AG, 2017, Arxiv, DOI arXiv:1704.04861
[10]   LVIS: A Dataset for Large Vocabulary Instance Segmentation [J].
Gupta, Agrim ;
Dollar, Piotr ;
Girshick, Ross .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :5351-5359