A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future

被引:3
|
作者
Zhu, Chaoyang [1 ]
Chen, Long [1 ]
机构
[1] Hong Kong Univ Sci & Technol, Dept Comp Sci & Engn, Kowloon, Hong Kong, Peoples R China
关键词
Open-vocabulary; zero-shot learning; object detection; image segmentation; future directions; OBJECT; LANGUAGE;
D O I
10.1109/TPAMI.2024.3413013
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
As the most fundamental scene understanding tasks, object detection and segmentation have made tremendous progress in deep learning era. Due to the expensive manual labeling cost, the annotated categories in existing datasets are often small-scale and pre-defined, i.e., state-of-the-art fully-supervised detectors and segmentors fail to generalize beyond the closed vocabulary. To resolve this limitation, in the last few years, the community has witnessed an increasing attention toward Open-Vocabulary Detection (OVD) and Segmentation (OVS). By "open-vocabulary", we mean that the models can classify objects beyond pre-defined categories. In this survey, we provide a comprehensive review on recent developments of OVD and OVS. A taxonomy is first developed to organize different tasks and methodologies. We find that the permission and usage of weak supervision signals can well discriminate different methodologies, including: visual-semantic space mapping, novel visual feature synthesis, region-aware training, pseudo-labeling, knowledge distillation, and transfer learning. The proposed taxonomy is universal across different tasks, covering object detection, semantic/instance/panoptic segmentation, 3D and video understanding. The main design principles, key challenges, development routes, methodology strengths, and weaknesses are thoroughly analyzed.
引用
收藏
页码:8954 / 8975
页数:22
相关论文
共 50 条
  • [11] OV-VIS: Open-Vocabulary Video Instance Segmentation
    Wang, Haochen
    Yan, Cilin
    Chen, Keyan
    Jiang, Xiaolong
    Tang, Xu
    Hu, Yao
    Kang, Guoliang
    Xie, Weidi
    Gavves, Efstratios
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2024, 132 (11) : 5048 - 5065
  • [12] Image-text aggregation for open-vocabulary semantic segmentation
    Cheng, Shengyang
    Huang, Jianyong
    Wang, Xiaodong
    Huang, Lei
    Wei, Zhiqiang
    NEUROCOMPUTING, 2025, 630
  • [13] LLMFormer: Large Language Model for Open-Vocabulary Semantic Segmentation
    Shi, Hengcan
    Dao, Son Duy
    Cai, Jianfei
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2025, 133 (02) : 742 - 759
  • [14] TAG: Guidance-Free Open-Vocabulary Semantic Segmentation
    Kawano, Yasufumi
    Aoki, Yoshimitsu
    IEEE ACCESS, 2024, 12 : 88322 - 88331
  • [15] Expanding the Horizons: Exploring Further Steps in Open-Vocabulary Segmentation
    Wang, Xihua
    Ji, Lei
    Yan, Kun
    Sun, Yuchong
    Song, Ruihua
    PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2023, PT X, 2024, 14434 : 407 - 419
  • [16] OV-DAR: Open-Vocabulary Object Detection and Attributes Recognition
    Chen, Keyan
    Jiang, Xiaolong
    Wang, Haochen
    Yan, Cilin
    Gao, Yan
    Tang, Xu
    Hu, Yao
    Xie, Weidi
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2024, 132 (11) : 5387 - 5409
  • [17] Search3D: Hierarchical Open-Vocabulary 3D Segmentation
    Takmaz, Ayca
    Delitzas, Alexandros
    Sumner, Robert W.
    Engelmann, Francis
    Wald, Johanna
    Tombari, Federico
    IEEE ROBOTICS AND AUTOMATION LETTERS, 2025, 10 (03): : 2558 - 2565
  • [18] A survey on face detection in the wild: Past, present and future
    Zafeiriou, Stefanos
    Zhang, Cha
    Zhang, Zhengyou
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2015, 138 : 1 - 24
  • [19] Exploring Multi-Modal Contextual Knowledge for Open-Vocabulary Object Detection
    Xu, Yifan
    Zhang, Mengdan
    Yang, Xiaoshan
    Xu, Changsheng
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2024, 33 : 6253 - 6267
  • [20] CLIP-VIS: Adapting CLIP for Open-Vocabulary Video Instance Segmentation
    Zhu, Wenqi
    Cao, Jiale
    Xie, Jin
    Yang, Shuangming
    Pang, Yanwei
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2025, 35 (02) : 1098 - 1110