A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future

被引：3

作者：

Zhu, Chaoyang ^{[1
]}

Chen, Long ^{[1
]}

机构：

[1] Hong Kong Univ Sci & Technol, Dept Comp Sci & Engn, Kowloon, Hong Kong, Peoples R China

来源：

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE | 2024年 / 46卷 / 12期

关键词：

Open-vocabulary; zero-shot learning; object detection; image segmentation; future directions; OBJECT; LANGUAGE;

D O I：

10.1109/TPAMI.2024.3413013

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

As the most fundamental scene understanding tasks, object detection and segmentation have made tremendous progress in deep learning era. Due to the expensive manual labeling cost, the annotated categories in existing datasets are often small-scale and pre-defined, i.e., state-of-the-art fully-supervised detectors and segmentors fail to generalize beyond the closed vocabulary. To resolve this limitation, in the last few years, the community has witnessed an increasing attention toward Open-Vocabulary Detection (OVD) and Segmentation (OVS). By "open-vocabulary", we mean that the models can classify objects beyond pre-defined categories. In this survey, we provide a comprehensive review on recent developments of OVD and OVS. A taxonomy is first developed to organize different tasks and methodologies. We find that the permission and usage of weak supervision signals can well discriminate different methodologies, including: visual-semantic space mapping, novel visual feature synthesis, region-aware training, pseudo-labeling, knowledge distillation, and transfer learning. The proposed taxonomy is universal across different tasks, covering object detection, semantic/instance/panoptic segmentation, 3D and video understanding. The main design principles, key challenges, development routes, methodology strengths, and weaknesses are thoroughly analyzed.

引用

页码：8954 / 8975

页数：22

共 50 条

[41] Optimized Tokenization Process for Open-Vocabulary Code Completion: An Empirical Study [J].

Hussain, Yasir ;

Huang, Zhiqiu ;

Zhou, Yu ;

Khan, Izhar Ahmed ;

Khan, Nasrullah ;

Abbas, Muhammad Zahid .

27TH INTERNATIONAL CONFERENCE ON EVALUATION AND ASSESSMENT IN SOFTWARE ENGINEERING, EASE 2023, 2023, :398-405

[42] A Survey on Evolutionary Computation for Computer Vision and Image Analysis: Past, Present, and Future Trends [J].

Bi, Ying ;

Xue, Bing ;

Mesejo, Pablo ;

Cagnoni, Stefano ;

Zhang, Mengjie .

IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, 2023, 27 (01) :5-25

[43] Cross-Modal Collaboration and Robust Feature Classifier for Open-Vocabulary 3D Object Detection [J].

Liu, Hengsong ;

Duan, Tongle .

SENSORS, 2025, 25 (02)

[44] SegLD: Achieving universal, zero-shot and open-vocabulary segmentation through multimodal fusion via latent diffusion processes [J].

Zheng, Hongtao ;

Ding, Yifei ;

Wang, Zilong ;

Huang, Xinyan .

INFORMATION FUSION, 2024, 111

[45] Plan, Posture and Go: Towards Open-Vocabulary Text-to-Motion Generation [J].

Liu, Jinpeng ;

Dai, Wenxun ;

Wang, Chunyu ;

Cheng, Yiji ;

Tang, Yansong ;

Tong, Xin .

COMPUTER VISION - ECCV 2024, PT XXVII, 2025, 15085 :445-463

[46] Dense Multimodal Alignment for Open-Vocabulary 3D Scene Understanding [J].

Li, Ruihuang ;

Zhang, Zhengqiang ;

He, Chenheng ;

Ma, Zhiyuan ;

Patel, Vishal M. ;

Zhang, Lei .

COMPUTER VISION - ECCV 2024, PT XLIX, 2025, 15107 :416-434

[47] Open-vocabulary multi-label classification with visual and textual features fusion [J].

Liu, Tongtong ;

Yang, Chen ;

Chen, Guoqiang ;

Li, Wenhui .

VISUAL COMPUTER, 2024, :6027-6039

[48] Building an Open-Vocabulary Video CLIP Model With Better Architectures, Optimization and Data [J].

Wu, Zuxuan ;

Weng, Zejia ;

Peng, Wujian ;

Yang, Xitong ;

Li, Ang ;

Davis, Larry S. ;

Jiang, Yu-Gang .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (07) :4747-4762

[49] Subword-Based Compact Reconstruction for Open-Vocabulary Neural Word Embeddings [J].

Sasaki, Shota ;

Suzuki, Jun ;

Inui, Kentaro .

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 :3551-3564

[50] APOVIS: Automated pixel-level open-vocabulary instance segmentation through integration of pre-trained vision-language models and foundational segmentation models [J].

Ma, Qiujie ;

Yang, Shuqi ;

Zhang, Lijuan ;

Lan, Qing ;

Yang, Dongdong ;

Chen, Honghan ;

Tan, Ying .

IMAGE AND VISION COMPUTING, 2025, 154

← 1 2 3 4 5 →