Exploring Multi-Modal Contextual Knowledge for Open-Vocabulary Object Detection

被引:0
|
作者
Xu, Yifan [1 ,2 ]
Zhang, Mengdan [3 ]
Yang, Xiaoshan [1 ,2 ]
Xu, Changsheng [1 ,2 ]
机构
[1] Chinese Acad Sci, Univ Chinese Acad Sci, Inst Automat, MAIS, Beijing 100190, Peoples R China
[2] Peng Cheng Lab, Shenzhen 518066, Peoples R China
[3] Tencent Youtu Lab, Shanghai 200233, Peoples R China
基金
中国国家自然科学基金;
关键词
Transformers; Visualization; Detectors; Object detection; Context modeling; Proposals; Location awareness; Annotations; Vocabulary; Training; open-vocabulary; contextual knowledge;
D O I
10.1109/TIP.2024.3485518
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We explore multi-modal contextual knowledge learned through multi-modal masked language modeling to provide explicit localization guidance for novel classes in open-vocabulary object detection (OVD). Intuitively, a well-modeled and correctly predicted masked concept word should effectively capture the textual contexts, visual contexts, and the cross-modal correspondence between texts and regions, thereby automatically activating high attention on corresponding regions. In light of this, we propose a multi-modal contextual knowledge distillation framework, MMC-Det, to explicitly supervise a student detector with the context-aware attention of the masked concept words in a teacher fusion transformer. The teacher fusion transformer is trained with our newly proposed diverse multi-modal masked language modeling (D-MLM) strategy, which significantly enhances the fine-grained region-level visual context modeling in the fusion transformer. The proposed distillation process provides additional contextual guidance to the concept-region matching of the detector, thereby further improving the OVD performance. Extensive experiments performed upon various detection datasets show the effectiveness of our multi-modal context learning strategy.
引用
收藏
页码:6253 / 6267
页数:15
相关论文
共 50 条
  • [21] Prompt-guided DETR with RoI-pruned masked attention for open-vocabulary object detection
    Song, Hwanjun
    Bang, Jihwan
    PATTERN RECOGNITION, 2024, 155
  • [22] MULTI-MODAL FEATURE FUSION NETWORK FOR GHOST IMAGING OBJECT DETECTION
    Hu, Nan
    Ma, Huimin
    Le, Chao
    Shao, Xuehui
    2018 25TH IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2018, : 351 - 355
  • [23] Multi-Modal Dataset Generation using Domain Randomization for Object Detection
    Marez, Diego
    Nans, Lena
    Borden, Samuel
    GEOSPATIAL INFORMATICS XI, 2021, 11733
  • [24] CrossFormer: Cross-guided attention for multi-modal object detection
    Lee, Seungik
    Park, Jaehyeong
    Park, Jinsun
    PATTERN RECOGNITION LETTERS, 2024, 179 : 144 - 150
  • [25] Learning Adaptive Fusion Bank for Multi-Modal Salient Object Detection
    Wang, Kunpeng
    Tu, Zhengzheng
    Li, Chenglong
    Zhang, Cheng
    Luo, Bin
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (08) : 7344 - 7358
  • [26] MEDMCN: a novel multi-modal EfficientDet with multi-scale CapsNet for object detection
    Li, Xingye
    Liu, Jin
    Tang, Zhengyu
    Han, Bing
    Wu, Zhongdai
    JOURNAL OF SUPERCOMPUTING, 2024, 80 (09) : 12863 - 12890
  • [27] Height-Adaptive Deformable Multi-Modal Fusion for 3D Object Detection
    Li, Jiahao
    Chen, Lingshan
    Li, Zhen
    IEEE ACCESS, 2025, 13 : 52385 - 52396
  • [28] Multi-Modal Fusion Based on Depth Adaptive Mechanism for 3D Object Detection
    Liu, Zhanwen
    Cheng, Juanru
    Fan, Jin
    Lin, Shan
    Wang, Yang
    Zhao, Xiangmo
    IEEE TRANSACTIONS ON MULTIMEDIA, 2025, 27 : 707 - 717
  • [29] Deep learning based object detection from multi-modal sensors: an overview
    Ye Liu
    Shiyang Meng
    Hongzhang Wang
    Jun Liu
    Multimedia Tools and Applications, 2024, 83 : 19841 - 19870
  • [30] Multi-modal object detection using unsupervised transfer learning and adaptation techniques
    Abbott, Rachael
    Robertson, Neil
    del Rincon, Jesus Martinez
    Connor, Barry
    ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING IN DEFENSE APPLICATIONS, 2019, 11169