Exploring Multi-Modal Contextual Knowledge for Open-Vocabulary Object Detection

被引：0

作者：

Xu, Yifan ^{[1
,2
]}

Zhang, Mengdan ^{[3
]}

Yang, Xiaoshan ^{[1
,2
]}

Xu, Changsheng ^{[1
,2
]}

机构：

[1] Chinese Acad Sci, Univ Chinese Acad Sci, Inst Automat, MAIS, Beijing 100190, Peoples R China

[2] Peng Cheng Lab, Shenzhen 518066, Peoples R China

[3] Tencent Youtu Lab, Shanghai 200233, Peoples R China

来源：

IEEE TRANSACTIONS ON IMAGE PROCESSING | 2024年 / 33卷

基金：

中国国家自然科学基金;

关键词：

Transformers; Visualization; Detectors; Object detection; Context modeling; Proposals; Location awareness; Annotations; Vocabulary; Training; open-vocabulary; contextual knowledge;

D O I：

10.1109/TIP.2024.3485518

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We explore multi-modal contextual knowledge learned through multi-modal masked language modeling to provide explicit localization guidance for novel classes in open-vocabulary object detection (OVD). Intuitively, a well-modeled and correctly predicted masked concept word should effectively capture the textual contexts, visual contexts, and the cross-modal correspondence between texts and regions, thereby automatically activating high attention on corresponding regions. In light of this, we propose a multi-modal contextual knowledge distillation framework, MMC-Det, to explicitly supervise a student detector with the context-aware attention of the masked concept words in a teacher fusion transformer. The teacher fusion transformer is trained with our newly proposed diverse multi-modal masked language modeling (D-MLM) strategy, which significantly enhances the fine-grained region-level visual context modeling in the fusion transformer. The proposed distillation process provides additional contextual guidance to the concept-region matching of the detector, thereby further improving the OVD performance. Extensive experiments performed upon various detection datasets show the effectiveness of our multi-modal context learning strategy.

引用

页码：6253 / 6267

页数：15

共 50 条

[21] Prompt-guided DETR with RoI-pruned masked attention for open-vocabulary object detection
Song, Hwanjun
Bang, Jihwan
PATTERN RECOGNITION, 2024, 155
[22] MULTI-MODAL FEATURE FUSION NETWORK FOR GHOST IMAGING OBJECT DETECTION
Hu, Nan
Ma, Huimin
Le, Chao
Shao, Xuehui
2018 25TH IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2018, : 351 - 355
[23] Multi-Modal Dataset Generation using Domain Randomization for Object Detection
Marez, Diego
Nans, Lena
Borden, Samuel
GEOSPATIAL INFORMATICS XI, 2021, 11733
[24] CrossFormer: Cross-guided attention for multi-modal object detection
Lee, Seungik
Park, Jaehyeong
Park, Jinsun
PATTERN RECOGNITION LETTERS, 2024, 179 : 144 - 150
[25] Learning Adaptive Fusion Bank for Multi-Modal Salient Object Detection
Wang, Kunpeng
Tu, Zhengzheng
Li, Chenglong
Zhang, Cheng
Luo, Bin
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (08) : 7344 - 7358
[26] MEDMCN: a novel multi-modal EfficientDet with multi-scale CapsNet for object detection
Li, Xingye
Liu, Jin
Tang, Zhengyu
Han, Bing
Wu, Zhongdai
JOURNAL OF SUPERCOMPUTING, 2024, 80 (09) : 12863 - 12890
[27] Height-Adaptive Deformable Multi-Modal Fusion for 3D Object Detection
Li, Jiahao
Chen, Lingshan
Li, Zhen
IEEE ACCESS, 2025, 13 : 52385 - 52396
[28] Multi-Modal Fusion Based on Depth Adaptive Mechanism for 3D Object Detection
Liu, Zhanwen
Cheng, Juanru
Fan, Jin
Lin, Shan
Wang, Yang
Zhao, Xiangmo
IEEE TRANSACTIONS ON MULTIMEDIA, 2025, 27 : 707 - 717
[29] Deep learning based object detection from multi-modal sensors: an overview
Ye Liu
Shiyang Meng
Hongzhang Wang
Jun Liu
Multimedia Tools and Applications, 2024, 83 : 19841 - 19870
[30] Multi-modal object detection using unsupervised transfer learning and adaptation techniques
Abbott, Rachael
Robertson, Neil
del Rincon, Jesus Martinez
Connor, Barry
ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING IN DEFENSE APPLICATIONS, 2019, 11169

← 1 2 3 4 5 →