Exploring Multi-Modal Contextual Knowledge for Open-Vocabulary Object Detection

被引:0
|
作者
Xu, Yifan [1 ,2 ]
Zhang, Mengdan [3 ]
Yang, Xiaoshan [1 ,2 ]
Xu, Changsheng [1 ,2 ]
机构
[1] Chinese Acad Sci, Univ Chinese Acad Sci, Inst Automat, MAIS, Beijing 100190, Peoples R China
[2] Peng Cheng Lab, Shenzhen 518066, Peoples R China
[3] Tencent Youtu Lab, Shanghai 200233, Peoples R China
基金
中国国家自然科学基金;
关键词
Transformers; Visualization; Detectors; Object detection; Context modeling; Proposals; Location awareness; Annotations; Vocabulary; Training; open-vocabulary; contextual knowledge;
D O I
10.1109/TIP.2024.3485518
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We explore multi-modal contextual knowledge learned through multi-modal masked language modeling to provide explicit localization guidance for novel classes in open-vocabulary object detection (OVD). Intuitively, a well-modeled and correctly predicted masked concept word should effectively capture the textual contexts, visual contexts, and the cross-modal correspondence between texts and regions, thereby automatically activating high attention on corresponding regions. In light of this, we propose a multi-modal contextual knowledge distillation framework, MMC-Det, to explicitly supervise a student detector with the context-aware attention of the masked concept words in a teacher fusion transformer. The teacher fusion transformer is trained with our newly proposed diverse multi-modal masked language modeling (D-MLM) strategy, which significantly enhances the fine-grained region-level visual context modeling in the fusion transformer. The proposed distillation process provides additional contextual guidance to the concept-region matching of the detector, thereby further improving the OVD performance. Extensive experiments performed upon various detection datasets show the effectiveness of our multi-modal context learning strategy.
引用
收藏
页码:6253 / 6267
页数:15
相关论文
共 50 条
  • [1] Open-Vocabulary Object Detection via Scene Graph Discovery
    Shi, Hengcan
    Hayat, Munawar
    Cai, Jianfei
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4012 - 4021
  • [2] OV-DAR: Open-Vocabulary Object Detection and Attributes Recognition
    Chen, Keyan
    Jiang, Xiaolong
    Wang, Haochen
    Yan, Cilin
    Gao, Yan
    Tang, Xu
    Hu, Yao
    Xie, Weidi
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2024, 132 (11) : 5387 - 5409
  • [3] Understanding object descriptions in robotics by open-vocabulary object retrieval and detection
    Guadarrama, Sergio
    Rodner, Erik
    Saenko, Kate
    Darrell, Trevor
    INTERNATIONAL JOURNAL OF ROBOTICS RESEARCH, 2016, 35 (1-3) : 265 - 280
  • [4] Open-Vocabulary Camouflaged Object Segmentation
    Pang, Youwei
    Zhao, Xiaoqi
    Zuo, Jiaming
    Zhang, Lihe
    Lu, Huchuan
    COMPUTER VISION - ECCV 2024, PT XLVII, 2025, 15105 : 476 - 495
  • [5] Open-vocabulary object detection via debiased curriculum self-training
    Zhang, Hanlue
    Guan, Dayan
    Ke, Xiangrui
    El Saddik, Abdulmotaleb
    Lu, Shijian
    EXPERT SYSTEMS WITH APPLICATIONS, 2024, 255
  • [6] OpenSight: A Simple Open-Vocabulary Framework for LiDAR-Based Object Detection
    Zhang, Hu
    Ku, Jianhua
    Tang, Tao
    Sun, Haiyang
    Huang, Xin
    Huang, Zi
    Yu, Kaicheng
    COMPUTER VISION - ECCV 2024, PT LXXXIV, 2025, 15142 : 1 - 19
  • [7] Open-Vocabulary Category-Level Object Pose and Size Estimation
    Cai, Junhao
    He, Yisheng
    Yuan, Weihao
    Zhu, Siyu
    Dong, Zilong
    Bo, Liefeng
    Chen, Qifeng
    IEEE ROBOTICS AND AUTOMATION LETTERS, 2024, 9 (09): : 7661 - 7668
  • [8] Deep Multi-modal Object Detection for Autonomous Driving
    Ennajar, Amal
    Khouja, Nadia
    Boutteau, Remi
    Tlili, Fethi
    2021 18TH INTERNATIONAL MULTI-CONFERENCE ON SYSTEMS, SIGNALS & DEVICES (SSD), 2021, : 7 - 11
  • [9] A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future
    Zhu, Chaoyang
    Chen, Long
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (12) : 8954 - 8975
  • [10] Multi-modal object detection via transformer network
    Liu, Wenbing
    Wang, Haibo
    Gao, Quanxue
    Zhu, Zhaorui
    IET IMAGE PROCESSING, 2023, 17 (12) : 3541 - 3550