OmDet: Large-scale vision-language multi-dataset pre-training with multimodal detection network

被引：0

作者：

Zhao, Tiancheng ^{[1
]}

Liu, Peng ^{[2
]}

Lee, Kyusong ^{[1
]}

机构：

[1] Zhejiang Univ, Binjiang Inst, Hangzhou, Zhejiang, Peoples R China

[2] Linker Technol Res, Hangzhou, Zhejiang, Peoples R China

来源：

IET COMPUTER VISION | 2024年 / 18卷 / 05期

关键词：

computer vision; object detection; object recognition; AUTISM SPECTRUM DISORDER; CHILDREN; METAANALYSIS; DIAGNOSIS;

D O I：

10.1049/cvi2.12268

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The advancement of object detection (OD) in open-vocabulary and open-world scenarios is a critical challenge in computer vision. OmDet, a novel language-aware object detection architecture and an innovative training mechanism that harnesses continual learning and multi-dataset vision-language pre-training is introduced. Leveraging natural language as a universal knowledge representation, OmDet accumulates "visual vocabularies" from diverse datasets, unifying the task as a language-conditioned detection framework. The multimodal detection network (MDN) overcomes the challenges of multi-dataset joint training and generalizes to numerous training datasets without manual label taxonomy merging. The authors demonstrate superior performance of OmDet over strong baselines in object detection in the wild, open-vocabulary detection, and phrase grounding, achieving state-of-the-art results. Ablation studies reveal the impact of scaling the pre-training visual vocabulary, indicating a promising direction for further expansion to larger datasets. The effectiveness of our deep fusion approach is underscored by its ability to learn jointly from multiple datasets, enhancing performance through knowledge sharing. OmDet, a novel language-aware detector, designed to enhance open-vocabulary and open-world object detection through a continual learning approach and multi-dataset vision-language pre-training is presented. By using natural language for knowledge representation, the authors successfully increase the "visual vocabulary ˮ; and create a unified, language-conditioned detection framework, which outperforms previous models on object detection and phrase grounding. This promising method proves the effectiveness of joint learning from multiple datasets and presents a path forward for scaling to even larger datasets.image

引用

页码：626 / 639

页数：14

共 60 条

[1] Carion Nicolas, 2020, Computer Vision - ECCV 2020. 16th European Conference. Proceedings. Lecture Notes in Computer Science (LNCS 12346), P213, DOI 10.1007/978-3-030-58452-8_13
[2] Change Loy C., 2019, arXiv
[3] Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts
Changpinyo, Soravit
Sharma, Piyush
Ding, Nan
Soricut, Radu
[J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 3557 - 3567
[4] Ciaglia F., 2022, arXiv
[5] Dynamic Head: Unifying Object Detection Heads with Attentions
Dai, Xiyang
Chen, Yinpeng
Xiao, Bin
Chen, Dongdong
Liu, Mengchen
Yuan, Lu
Zhang, Lei
[J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 7369 - 7378
[6] Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848
[7] Deng Jiajun, 2021, Proceedings of the IEEE/CVF International Conference on Computer Vision, P1769
[8] Devlin J, 2019, Arxiv, DOI arXiv:1810.04805
[9] Dosovitskiy A, 2021, Arxiv, DOI arXiv:2010.11929
[10] Du Y., 2022, PROC IEEECVF C COMPU, P14084

← 1 2 3 4 5 6 →