YOLO-World: Real-Time Open-Vocabulary Object Detection

被引:127
作者
Cheng, Tianheng [2 ,3 ]
Sone, Lin [1 ]
Ge, Yixiao [1 ,2 ]
Liu, Wenyu [3 ]
Wang, Xinggang [3 ]
Shan, Yong [1 ,2 ]
机构
[1] Tencent AI Lab, Shenzhen, Guangdong, Peoples R China
[2] Tencent PCG, ARC Lab, Shenzhen, Guangdong, Peoples R China
[3] Huazhong Univ Sci & Technol, Sch EIC, Wuhan, Hubei, Peoples R China
来源
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2024年
基金
中国国家自然科学基金;
关键词
D O I
10.1109/CVPR52733.2024.01599
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The You Only Look Once (YOLO) series of detectors have established themselves as efficient and practical tools. However, their reliance on predefined and trained object categories limits their applicability in open scenarios. Addressing this limitation, we introduce YOLO-World, an innovative approach that enhances YOLO with open-vocabulary detection capabilities through vision-language modeling and pre-training on large-scale datasets. Specifically, we propose a new Re-parameterizable Vision-Language Path Aggregation Network (RepVL-PAN) and region-text contrastive loss to facilitate the interaction between visual and linguistic information. Our method excels in detecting a wide range of objects in a zero-shot manner with high efficiency. On the challenging LVIS dataset, YOLO- World achieves 35.4 AP with 52.0 FPS on V100, which outperforms many state-of-the-art methods in terms of both accuracy and speed. Furthermore, the fine-tuned YOLO-World achieves remarkable performance on several downstream tasks, including object detection and open-vocabulary instance segmentation. Code and models are available at: https://github.com/AILab-CVC/YOLO-World.
引用
收藏
页码:16901 / 16911
页数:11
相关论文
共 65 条
[1]  
[Anonymous], 2022, CVPR, DOI DOI 10.1109/CVPR52688.2022.01069
[2]  
[Anonymous], 2016, CVPR, DOI DOI 10.1109/CVPR.2016.91
[3]  
[Anonymous], 2021, CVPR, DOI DOI 10.1109/CVPR46437.2021.01416
[4]  
[Anonymous], 2017, PROC CVPR IEEE, DOI DOI 10.1109/CVPR.2017.690
[5]  
[Anonymous], 2023, ICCV, DOI DOI 10.1109/ICCV51070.2023.01441
[6]   End-to-End Object Detection with Transformers [J].
Carion, Nicolas ;
Massa, Francisco ;
Synnaeve, Gabriel ;
Usunier, Nicolas ;
Kirillov, Alexander ;
Zagoruyko, Sergey .
COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :213-229
[7]   Hybrid Task Cascade for Instance Segmentation [J].
Chen, Kai ;
Pang, Jiangmiao ;
Wang, Jiaqi ;
Xiong, Yu ;
Li, Xiaoxiao ;
Sun, Shuyang ;
Feng, Wansen ;
Liu, Ziwei ;
Shi, Jianping ;
Ouyang, Wanli ;
Loy, Chen Change ;
Lin, Dahua .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :4969-4978
[8]  
Contributors M., 2022, MMYOLO: OpenMMLab YOLO Series Toolbox and Benchmark
[9]  
Dave A., 2021, ARXIV
[10]  
Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848