Unified Open-Vocabulary Dense Visual Prediction

被引:11
作者
Shi, Hengcan [1 ]
Hayat, Munawar [2 ]
Cai, Jianfei [2 ]
机构
[1] Hunan Univ, Coll Elect & Informat Engn, Changsha 410012, Peoples R China
[2] Monash Univ, Dept Data Sci & AI, Melbourne 3800, Australia
关键词
Task analysis; Training; Decoding; Visualization; Feature extraction; Semantics; Object detection; Open-vocabulary; object detection; image segmentation;
D O I
10.1109/TMM.2024.3381835
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In recent years, open-vocabulary (OV) dense visual prediction (such as OV object detection, semantic, instance and panoptic segmentations) has attracted increasing research attention. However, most of the existing approaches are task-specific, i.e., tackling each task individually. In this paper, we propose a Unified Open-Vocabulary Network (UOVN) to jointly address four common dense prediction tasks. Compared with separate models, a unified network is more desirable for diverse industrial applications. Moreover, OV dense prediction training data is relatively less. Separate networks can only leverage task-relevant training data, while a unified approach can integrate diverse data to boost individual tasks. We address two major challenges in unified OV prediction. First, unlike unified methods for fixed-set predictions, OV networks are usually trained with multi-modal data. Therefore, we propose a multi-modal, multi-scale and multi-task (MMM) decoding mechanism to better exploit multi-modal information for OV recognition. Second, because UOVN uses data from different tasks for training, there are significant domain and task gaps. We present a UOVN training mechanism to reduce such gaps. Experiments on four datasets demonstrate the effectiveness of our UOVN.
引用
收藏
页码:8704 / 8716
页数:13
相关论文
共 50 条
  • [41] Purify Then Guide: A Bi-Directional Bridge Network for Open-Vocabulary Semantic Segmentation
    Pan, Yuwen
    Sun, Rui
    Wang, Yuan
    Yang, Wenfei
    Zhang, Tianzhu
    Zhang, Yongdong
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2025, 35 (01) : 343 - 356
  • [42] DOZE: A Dataset for Open-Vocabulary Zero-Shot Object Navigation in Dynamic Environments
    Ma, Ji
    Dai, Hongming
    Mu, Yao
    Wu, Pengying
    Wang, Hao
    Chi, Xiaowei
    Fei, Yang
    Zhang, Shanghang
    Liu, Chang
    IEEE ROBOTICS AND AUTOMATION LETTERS, 2024, 9 (09): : 7389 - 7396
  • [43] Learning Audio-Text Agreement for Open-vocabulary Keyword Spotting
    Shin, Hyeon-Kyeong
    Han, Hyewon
    Kim, Doyeon
    Chung, Soo-Whan
    Kang, Hong-Goo
    INTERSPEECH 2022, 2022, : 1871 - 1875
  • [44] Open-Vocabulary Animal Keypoint Detection with Semantic-Feature Matching
    Zhang, Hao
    Xu, Lumin
    Lai, Shenqi
    Shao, Wenqi
    Zheng, Nanning
    Luo, Ping
    Qiao, Yu
    Zhang, Kaipeng
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2024, 132 (12) : 5741 - 5758
  • [45] Optimized Tokenization Process for Open-Vocabulary Code Completion: An Empirical Study
    Hussain, Yasir
    Huang, Zhiqiu
    Zhou, Yu
    Khan, Izhar Ahmed
    Khan, Nasrullah
    Abbas, Muhammad Zahid
    27TH INTERNATIONAL CONFERENCE ON EVALUATION AND ASSESSMENT IN SOFTWARE ENGINEERING, EASE 2023, 2023, : 398 - 405
  • [46] Can Identifier Splitting Improve Open-Vocabulary Language Model of Code
    Shi, Jieke
    Yang, Zhou
    He, Junda
    Xu, Bowen
    Lo, David
    2022 IEEE INTERNATIONAL CONFERENCE ON SOFTWARE ANALYSIS, EVOLUTION AND REENGINEERING (SANER 2022), 2022, : 1134 - 1138
  • [47] Plan, Posture and Go: Towards Open-Vocabulary Text-to-Motion Generation
    Liu, Jinpeng
    Dai, Wenxun
    Wang, Chunyu
    Cheng, Yiji
    Tang, Yansong
    Tong, Xin
    COMPUTER VISION - ECCV 2024, PT XXVII, 2025, 15085 : 445 - 463
  • [48] OV-NeRF: Open-Vocabulary Neural Radiance Fields With Vision and Language Foundation Models for 3D Semantic Understanding
    Liao, Guibiao
    Zhou, Kaichen
    Bao, Zhenyu
    Liu, Kanglin
    Li, Qing
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (12) : 12923 - 12936
  • [49] Show, Tell and Summarize: Dense Video Captioning Using Visual Cue Aided Sentence Summarization
    Zhang, Zhiwang
    Xu, Dong
    Ouyang, Wanli
    Tan, Chuanqi
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2020, 30 (09) : 3130 - 3139
  • [50] Prompt-guided DETR with RoI-pruned masked attention for open-vocabulary object detection
    Song, Hwanjun
    Bang, Jihwan
    PATTERN RECOGNITION, 2024, 155