Exploring Multi-Modal Contextual Knowledge for Open-Vocabulary Object Detection

被引:0
|
作者
Xu, Yifan [1 ,2 ]
Zhang, Mengdan [3 ]
Yang, Xiaoshan [1 ,2 ]
Xu, Changsheng [1 ,2 ]
机构
[1] Chinese Acad Sci, Univ Chinese Acad Sci, Inst Automat, MAIS, Beijing 100190, Peoples R China
[2] Peng Cheng Lab, Shenzhen 518066, Peoples R China
[3] Tencent Youtu Lab, Shanghai 200233, Peoples R China
基金
中国国家自然科学基金;
关键词
Transformers; Visualization; Detectors; Object detection; Context modeling; Proposals; Location awareness; Annotations; Vocabulary; Training; open-vocabulary; contextual knowledge;
D O I
10.1109/TIP.2024.3485518
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We explore multi-modal contextual knowledge learned through multi-modal masked language modeling to provide explicit localization guidance for novel classes in open-vocabulary object detection (OVD). Intuitively, a well-modeled and correctly predicted masked concept word should effectively capture the textual contexts, visual contexts, and the cross-modal correspondence between texts and regions, thereby automatically activating high attention on corresponding regions. In light of this, we propose a multi-modal contextual knowledge distillation framework, MMC-Det, to explicitly supervise a student detector with the context-aware attention of the masked concept words in a teacher fusion transformer. The teacher fusion transformer is trained with our newly proposed diverse multi-modal masked language modeling (D-MLM) strategy, which significantly enhances the fine-grained region-level visual context modeling in the fusion transformer. The proposed distillation process provides additional contextual guidance to the concept-region matching of the detector, thereby further improving the OVD performance. Extensive experiments performed upon various detection datasets show the effectiveness of our multi-modal context learning strategy.
引用
收藏
页码:6253 / 6267
页数:15
相关论文
共 50 条
  • [41] Object detection based on multi-modal adaptive fusion using YOLOv3
    Sheikh, Aarfa Bano
    Baru, Apurva
    Desai, Sanjana Shinde
    Mangale, Supriya
    JOURNAL OF APPLIED REMOTE SENSING, 2022, 16 (02)
  • [42] Multi-scale multi-modal fusion for object detection in autonomous driving based on selective kernel
    Gao, Xin
    Zhang, Guoying
    Xiong, Yijin
    MEASUREMENT, 2022, 194
  • [43] Exploiting Multi-Modal Synergies for Enhancing 3D Multi-Object Tracking
    Xu, Xinglong
    Ren, Weihong
    Chen, Xi'ai
    Fan, Huijie
    Han, Zhi
    Liu, Honghai
    IEEE ROBOTICS AND AUTOMATION LETTERS, 2024, 9 (10): : 8643 - 8650
  • [44] Industrial object detection with multi-modal SSD: closing the gap between synthetic and real images
    Julia Cohen
    Carlos Crispim-Junior
    Jean-Marc Chiappa
    Laure Tougne Rodet
    Multimedia Tools and Applications, 2024, 83 : 12111 - 12138
  • [45] Deep Multi-Modal Object Detection and Semantic Segmentation for Autonomous Driving: Datasets, Methods, and Challenges
    Feng, Di
    Haase-Schutz, Christian
    Rosenbaum, Lars
    Hertlein, Heinz
    Glaser, Claudius
    Timm, Fabian
    Wiesbeck, Werner
    Dietmayer, Klaus
    IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, 2021, 22 (03) : 1341 - 1360
  • [46] Industrial object detection with multi-modal SSD: closing the gap between synthetic and real images
    Cohen, Julia
    Crispim-Junior, Carlos
    Chiappa, Jean-Marc
    Rodet, Laure Tougne
    MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (04) : 12111 - 12138
  • [47] Multi-Modal System for Walking Safety for the Visually Impaired: Multi-Object Detection and Natural Language Generation
    Lee, Jekyung
    Cha, Kyung-Ae
    Lee, Miran
    APPLIED SCIENCES-BASEL, 2024, 14 (17):
  • [48] EPNet plus plus : Cascade Bi-Directional Fusion for Multi-Modal 3D Object Detection
    Liu, Zhe
    Huang, Tengteng
    Li, Bingling
    Chen, Xiwu
    Wang, Xi
    Bai, Xiang
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (07) : 8324 - 8341
  • [49] A Corresponding Region Fusion Framework for Multi-Modal Cervical Lesion Detection
    Chen, Tingting
    Zheng, Wenhao
    Hu, Heping
    Luo, Chunhua
    Chen, Jintai
    Yuan, Chunnv
    Lu, Weiguo
    Chen, Danny Z.
    Gao, Honghao
    Wu, Jian
    IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2024, 21 (04) : 959 - 970
  • [50] Robust Domain Misinformation Detection via Multi-Modal Feature Alignment
    Liu, Hui
    Wang, Wenya
    Sun, Hao
    Rocha, Anderson
    Li, Haoliang
    IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, 2024, 19 : 793 - 806