Exploring Multi-Modal Contextual Knowledge for Open-Vocabulary Object Detection

被引:0
|
作者
Xu, Yifan [1 ,2 ]
Zhang, Mengdan [3 ]
Yang, Xiaoshan [1 ,2 ]
Xu, Changsheng [1 ,2 ]
机构
[1] Chinese Acad Sci, Univ Chinese Acad Sci, Inst Automat, MAIS, Beijing 100190, Peoples R China
[2] Peng Cheng Lab, Shenzhen 518066, Peoples R China
[3] Tencent Youtu Lab, Shanghai 200233, Peoples R China
基金
中国国家自然科学基金;
关键词
Transformers; Visualization; Detectors; Object detection; Context modeling; Proposals; Location awareness; Annotations; Vocabulary; Training; open-vocabulary; contextual knowledge;
D O I
10.1109/TIP.2024.3485518
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We explore multi-modal contextual knowledge learned through multi-modal masked language modeling to provide explicit localization guidance for novel classes in open-vocabulary object detection (OVD). Intuitively, a well-modeled and correctly predicted masked concept word should effectively capture the textual contexts, visual contexts, and the cross-modal correspondence between texts and regions, thereby automatically activating high attention on corresponding regions. In light of this, we propose a multi-modal contextual knowledge distillation framework, MMC-Det, to explicitly supervise a student detector with the context-aware attention of the masked concept words in a teacher fusion transformer. The teacher fusion transformer is trained with our newly proposed diverse multi-modal masked language modeling (D-MLM) strategy, which significantly enhances the fine-grained region-level visual context modeling in the fusion transformer. The proposed distillation process provides additional contextual guidance to the concept-region matching of the detector, thereby further improving the OVD performance. Extensive experiments performed upon various detection datasets show the effectiveness of our multi-modal context learning strategy.
引用
收藏
页码:6253 / 6267
页数:15
相关论文
共 50 条
  • [31] Deep learning based object detection from multi-modal sensors: an overview
    Liu, Ye
    Meng, Shiyang
    Wang, Hongzhang
    Liu, Jun
    MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (07) : 19841 - 19870
  • [32] Open-Vocabulary Animal Keypoint Detection with Semantic-Feature Matching
    Zhang, Hao
    Xu, Lumin
    Lai, Shenqi
    Shao, Wenqi
    Zheng, Nanning
    Luo, Ping
    Qiao, Yu
    Zhang, Kaipeng
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2024, 132 (12) : 5741 - 5758
  • [33] MLKD-CLIP: Multi-layer Feature Knowledge Distillation of CLIP for Open-vocabulary Action Recognition
    Jingjing Wang
    Junyong Ye
    Xinyuan Liu
    Youwei Li
    Guangyi Xu
    Chaoming Zheng
    Multimedia Systems, 2025, 31 (3)
  • [34] Contextual and Cross-Modal Interaction for Multi-Modal Speech Emotion Recognition
    Yang, Dingkang
    Huang, Shuai
    Liu, Yang
    Zhang, Lihua
    IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 2093 - 2097
  • [35] UniTR: A Unified TRansformer-Based Framework for Co-Object and Multi-Modal Saliency Detection
    Guo, Ruohao
    Ying, Xianghua
    Qi, Yanyu
    Qu, Liao
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 7622 - 7635
  • [36] DOZE: A Dataset for Open-Vocabulary Zero-Shot Object Navigation in Dynamic Environments
    Ma, Ji
    Dai, Hongming
    Mu, Yao
    Wu, Pengying
    Wang, Hao
    Chi, Xiaowei
    Fei, Yang
    Zhang, Shanghang
    Liu, Chang
    IEEE ROBOTICS AND AUTOMATION LETTERS, 2024, 9 (09): : 7389 - 7396
  • [37] Open-vocabulary multi-label classification with visual and textual features fusion
    Liu, Tongtong
    Yang, Chen
    Chen, Guoqiang
    Li, Wenhui
    VISUAL COMPUTER, 2024, : 6027 - 6039
  • [38] Knowledge Enhanced Vision and Language Model for Multi-Modal Fake News Detection
    Gao, Xingyu
    Wang, Xi
    Chen, Zhenyu
    Zhou, Wei
    Hoi, Steven C. H.
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 8312 - 8322
  • [39] Multi-Modal Prototypes for Few-Shot Object Detection in Remote Sensing Images
    Liu, Yanxing
    Pan, Zongxu
    Yang, Jianwei
    Zhou, Peiling
    Zhang, Bingchen
    REMOTE SENSING, 2024, 16 (24)
  • [40] Small Object Detection Technology Using Multi-Modal Data Based on Deep Learning
    Park, Chi-Won
    Seo, Yuri
    Sun, Teh-Jen
    Lee, Ga-Won
    Huh, Eui-Nam
    2023 INTERNATIONAL CONFERENCE ON INFORMATION NETWORKING, ICOIN, 2023, : 420 - 422