Unified Open-Vocabulary Dense Visual Prediction

被引：11

作者：

Shi, Hengcan ^{[1
]}

Hayat, Munawar ^{[2
]}

Cai, Jianfei ^{[2
]}

机构：

[1] Hunan Univ, Coll Elect & Informat Engn, Changsha 410012, Peoples R China

[2] Monash Univ, Dept Data Sci & AI, Melbourne 3800, Australia

来源：

IEEE TRANSACTIONS ON MULTIMEDIA | 2024年 / 26卷

关键词：

Task analysis; Training; Decoding; Visualization; Feature extraction; Semantics; Object detection; Open-vocabulary; object detection; image segmentation;

D O I：

10.1109/TMM.2024.3381835

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

In recent years, open-vocabulary (OV) dense visual prediction (such as OV object detection, semantic, instance and panoptic segmentations) has attracted increasing research attention. However, most of the existing approaches are task-specific, i.e., tackling each task individually. In this paper, we propose a Unified Open-Vocabulary Network (UOVN) to jointly address four common dense prediction tasks. Compared with separate models, a unified network is more desirable for diverse industrial applications. Moreover, OV dense prediction training data is relatively less. Separate networks can only leverage task-relevant training data, while a unified approach can integrate diverse data to boost individual tasks. We address two major challenges in unified OV prediction. First, unlike unified methods for fixed-set predictions, OV networks are usually trained with multi-modal data. Therefore, we propose a multi-modal, multi-scale and multi-task (MMM) decoding mechanism to better exploit multi-modal information for OV recognition. Second, because UOVN uses data from different tasks for training, there are significant domain and task gaps. We present a UOVN training mechanism to reduce such gaps. Experiments on four datasets demonstrate the effectiveness of our UOVN.

引用

页码：8704 / 8716

页数：13

共 50 条

[1] CLIP-VIS: Adapting CLIP for Open-Vocabulary Video Instance Segmentation
Zhu, Wenqi
Cao, Jiale
Xie, Jin
Yang, Shuangming
Pang, Yanwei
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2025, 35 (02) : 1098 - 1110
[2] Exploring Multi-Modal Contextual Knowledge for Open-Vocabulary Object Detection
Xu, Yifan
Zhang, Mengdan
Yang, Xiaoshan
Xu, Changsheng
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2024, 33 : 6253 - 6267
[3] A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future
Zhu, Chaoyang
Chen, Long
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (12) : 8954 - 8975
[4] Generalization Boosted Adapter for Open-Vocabulary Segmentation
Xu, Wenhao
Wang, Changwei
Feng, Xuxiang
Xu, Rongtao
Huang, Longzhao
Zhang, Zherui
Guo, Li
Xu, Shibiao
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2025, 35 (01) : 520 - 533
[5] TAG: Guidance-Free Open-Vocabulary Semantic Segmentation
Kawano, Yasufumi
Aoki, Yoshimitsu
IEEE ACCESS, 2024, 12 : 88322 - 88331
[6] Unified Embedding Alignment for Open-Vocabulary Video Instance Segmentation
Fang, Hao
Wu, Peng
Li, Yawei
Zhang, Xinxin
Lu, Xiankai
COMPUTER VISION - ECCV 2024, PT LXX, 2025, 15128 : 225 - 241
[7] Open-Vocabulary Action Localization With Iterative Visual Prompting
Wake, Naoki
Kanehira, Atsushi
Sasabuchi, Kazuhiro
Takamatsu, Jun
Ikeuchi, Katsushi
IEEE ACCESS, 2025, 13 : 56908 - 56917
[8] Open-Vocabulary Category-Level Object Pose and Size Estimation
Cai, Junhao
He, Yisheng
Yuan, Weihao
Zhu, Siyu
Dong, Zilong
Bo, Liefeng
Chen, Qifeng
IEEE ROBOTICS AND AUTOMATION LETTERS, 2024, 9 (09): : 7661 - 7668
[9] OV-VG: A benchmark for open-vocabulary visual grounding
Wang, Chunlei
Feng, Wenquan
Li, Xiangtai
Cheng, Guangliang
Lyu, Shuchang
Liu, Binghao
Chen, Lijiang
Zhao, Qi
NEUROCOMPUTING, 2024, 591
[10] OpenObj: Open-Vocabulary Object-Level Neural Radiance Fields With Fine-Grained Understanding
Deng, Yinan
Wang, Jiahui
Zhao, Jingyu
Dou, Jianyu
Yang, Yi
Yue, Yufeng
IEEE ROBOTICS AND AUTOMATION LETTERS, 2025, 10 (01): : 652 - 659

← 1 2 3 4 5 →