Unified Open-Vocabulary Dense Visual Prediction

被引：11

作者：

Shi, Hengcan ^{[1
]}

Hayat, Munawar ^{[2
]}

Cai, Jianfei ^{[2
]}

机构：

[1] Hunan Univ, Coll Elect & Informat Engn, Changsha 410012, Peoples R China

[2] Monash Univ, Dept Data Sci & AI, Melbourne 3800, Australia

来源：

IEEE TRANSACTIONS ON MULTIMEDIA | 2024年 / 26卷

关键词：

Task analysis; Training; Decoding; Visualization; Feature extraction; Semantics; Object detection; Open-vocabulary; object detection; image segmentation;

D O I：

10.1109/TMM.2024.3381835

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

In recent years, open-vocabulary (OV) dense visual prediction (such as OV object detection, semantic, instance and panoptic segmentations) has attracted increasing research attention. However, most of the existing approaches are task-specific, i.e., tackling each task individually. In this paper, we propose a Unified Open-Vocabulary Network (UOVN) to jointly address four common dense prediction tasks. Compared with separate models, a unified network is more desirable for diverse industrial applications. Moreover, OV dense prediction training data is relatively less. Separate networks can only leverage task-relevant training data, while a unified approach can integrate diverse data to boost individual tasks. We address two major challenges in unified OV prediction. First, unlike unified methods for fixed-set predictions, OV networks are usually trained with multi-modal data. Therefore, we propose a multi-modal, multi-scale and multi-task (MMM) decoding mechanism to better exploit multi-modal information for OV recognition. Second, because UOVN uses data from different tasks for training, there are significant domain and task gaps. We present a UOVN training mechanism to reduce such gaps. Experiments on four datasets demonstrate the effectiveness of our UOVN.

引用

页码：8704 / 8716

页数：13

共 50 条

[41] Purify Then Guide: A Bi-Directional Bridge Network for Open-Vocabulary Semantic Segmentation
Pan, Yuwen
Sun, Rui
Wang, Yuan
Yang, Wenfei
Zhang, Tianzhu
Zhang, Yongdong
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2025, 35 (01) : 343 - 356
[42] DOZE: A Dataset for Open-Vocabulary Zero-Shot Object Navigation in Dynamic Environments
Ma, Ji
Dai, Hongming
Mu, Yao
Wu, Pengying
Wang, Hao
Chi, Xiaowei
Fei, Yang
Zhang, Shanghang
Liu, Chang
IEEE ROBOTICS AND AUTOMATION LETTERS, 2024, 9 (09): : 7389 - 7396
[43] Learning Audio-Text Agreement for Open-vocabulary Keyword Spotting
Shin, Hyeon-Kyeong
Han, Hyewon
Kim, Doyeon
Chung, Soo-Whan
Kang, Hong-Goo
INTERSPEECH 2022, 2022, : 1871 - 1875
[44] Open-Vocabulary Animal Keypoint Detection with Semantic-Feature Matching
Zhang, Hao
Xu, Lumin
Lai, Shenqi
Shao, Wenqi
Zheng, Nanning
Luo, Ping
Qiao, Yu
Zhang, Kaipeng
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2024, 132 (12) : 5741 - 5758
[45] Optimized Tokenization Process for Open-Vocabulary Code Completion: An Empirical Study
Hussain, Yasir
Huang, Zhiqiu
Zhou, Yu
Khan, Izhar Ahmed
Khan, Nasrullah
Abbas, Muhammad Zahid
27TH INTERNATIONAL CONFERENCE ON EVALUATION AND ASSESSMENT IN SOFTWARE ENGINEERING, EASE 2023, 2023, : 398 - 405
[46] Can Identifier Splitting Improve Open-Vocabulary Language Model of Code
Shi, Jieke
Yang, Zhou
He, Junda
Xu, Bowen
Lo, David
2022 IEEE INTERNATIONAL CONFERENCE ON SOFTWARE ANALYSIS, EVOLUTION AND REENGINEERING (SANER 2022), 2022, : 1134 - 1138
[47] Plan, Posture and Go: Towards Open-Vocabulary Text-to-Motion Generation
Liu, Jinpeng
Dai, Wenxun
Wang, Chunyu
Cheng, Yiji
Tang, Yansong
Tong, Xin
COMPUTER VISION - ECCV 2024, PT XXVII, 2025, 15085 : 445 - 463
[48] OV-NeRF: Open-Vocabulary Neural Radiance Fields With Vision and Language Foundation Models for 3D Semantic Understanding
Liao, Guibiao
Zhou, Kaichen
Bao, Zhenyu
Liu, Kanglin
Li, Qing
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (12) : 12923 - 12936
[49] Show, Tell and Summarize: Dense Video Captioning Using Visual Cue Aided Sentence Summarization
Zhang, Zhiwang
Xu, Dong
Ouyang, Wanli
Tan, Chuanqi
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2020, 30 (09) : 3130 - 3139
[50] Prompt-guided DETR with RoI-pruned masked attention for open-vocabulary object detection
Song, Hwanjun
Bang, Jihwan
PATTERN RECOGNITION, 2024, 155

← 1 2 3 4 5 →