Unified Open-Vocabulary Dense Visual Prediction

被引:11
|
作者
Shi, Hengcan [1 ]
Hayat, Munawar [2 ]
Cai, Jianfei [2 ]
机构
[1] Hunan Univ, Coll Elect & Informat Engn, Changsha 410012, Peoples R China
[2] Monash Univ, Dept Data Sci & AI, Melbourne 3800, Australia
关键词
Task analysis; Training; Decoding; Visualization; Feature extraction; Semantics; Object detection; Open-vocabulary; object detection; image segmentation;
D O I
10.1109/TMM.2024.3381835
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In recent years, open-vocabulary (OV) dense visual prediction (such as OV object detection, semantic, instance and panoptic segmentations) has attracted increasing research attention. However, most of the existing approaches are task-specific, i.e., tackling each task individually. In this paper, we propose a Unified Open-Vocabulary Network (UOVN) to jointly address four common dense prediction tasks. Compared with separate models, a unified network is more desirable for diverse industrial applications. Moreover, OV dense prediction training data is relatively less. Separate networks can only leverage task-relevant training data, while a unified approach can integrate diverse data to boost individual tasks. We address two major challenges in unified OV prediction. First, unlike unified methods for fixed-set predictions, OV networks are usually trained with multi-modal data. Therefore, we propose a multi-modal, multi-scale and multi-task (MMM) decoding mechanism to better exploit multi-modal information for OV recognition. Second, because UOVN uses data from different tasks for training, there are significant domain and task gaps. We present a UOVN training mechanism to reduce such gaps. Experiments on four datasets demonstrate the effectiveness of our UOVN.
引用
收藏
页码:8704 / 8716
页数:13
相关论文
共 50 条
  • [1] CLIP-VIS: Adapting CLIP for Open-Vocabulary Video Instance Segmentation
    Zhu, Wenqi
    Cao, Jiale
    Xie, Jin
    Yang, Shuangming
    Pang, Yanwei
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2025, 35 (02) : 1098 - 1110
  • [2] Exploring Multi-Modal Contextual Knowledge for Open-Vocabulary Object Detection
    Xu, Yifan
    Zhang, Mengdan
    Yang, Xiaoshan
    Xu, Changsheng
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2024, 33 : 6253 - 6267
  • [3] A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future
    Zhu, Chaoyang
    Chen, Long
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (12) : 8954 - 8975
  • [4] Generalization Boosted Adapter for Open-Vocabulary Segmentation
    Xu, Wenhao
    Wang, Changwei
    Feng, Xuxiang
    Xu, Rongtao
    Huang, Longzhao
    Zhang, Zherui
    Guo, Li
    Xu, Shibiao
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2025, 35 (01) : 520 - 533
  • [5] TAG: Guidance-Free Open-Vocabulary Semantic Segmentation
    Kawano, Yasufumi
    Aoki, Yoshimitsu
    IEEE ACCESS, 2024, 12 : 88322 - 88331
  • [6] Unified Embedding Alignment for Open-Vocabulary Video Instance Segmentation
    Fang, Hao
    Wu, Peng
    Li, Yawei
    Zhang, Xinxin
    Lu, Xiankai
    COMPUTER VISION - ECCV 2024, PT LXX, 2025, 15128 : 225 - 241
  • [7] Open-Vocabulary Action Localization With Iterative Visual Prompting
    Wake, Naoki
    Kanehira, Atsushi
    Sasabuchi, Kazuhiro
    Takamatsu, Jun
    Ikeuchi, Katsushi
    IEEE ACCESS, 2025, 13 : 56908 - 56917
  • [8] Open-Vocabulary Category-Level Object Pose and Size Estimation
    Cai, Junhao
    He, Yisheng
    Yuan, Weihao
    Zhu, Siyu
    Dong, Zilong
    Bo, Liefeng
    Chen, Qifeng
    IEEE ROBOTICS AND AUTOMATION LETTERS, 2024, 9 (09): : 7661 - 7668
  • [9] OV-VG: A benchmark for open-vocabulary visual grounding
    Wang, Chunlei
    Feng, Wenquan
    Li, Xiangtai
    Cheng, Guangliang
    Lyu, Shuchang
    Liu, Binghao
    Chen, Lijiang
    Zhao, Qi
    NEUROCOMPUTING, 2024, 591
  • [10] OpenObj: Open-Vocabulary Object-Level Neural Radiance Fields With Fine-Grained Understanding
    Deng, Yinan
    Wang, Jiahui
    Zhao, Jingyu
    Dou, Jianyu
    Yang, Yi
    Yue, Yufeng
    IEEE ROBOTICS AND AUTOMATION LETTERS, 2025, 10 (01): : 652 - 659