Exploring Multi-Modal Contextual Knowledge for Open-Vocabulary Object Detection

被引：0

作者：

Xu, Yifan ^{[1
,2
]}

Zhang, Mengdan ^{[3
]}

Yang, Xiaoshan ^{[1
,2
]}

Xu, Changsheng ^{[1
,2
]}

机构：

[1] Chinese Acad Sci, Univ Chinese Acad Sci, Inst Automat, MAIS, Beijing 100190, Peoples R China

[2] Peng Cheng Lab, Shenzhen 518066, Peoples R China

[3] Tencent Youtu Lab, Shanghai 200233, Peoples R China

来源：

IEEE TRANSACTIONS ON IMAGE PROCESSING | 2024年 / 33卷

基金：

中国国家自然科学基金;

关键词：

Transformers; Visualization; Detectors; Object detection; Context modeling; Proposals; Location awareness; Annotations; Vocabulary; Training; open-vocabulary; contextual knowledge;

D O I：

10.1109/TIP.2024.3485518

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We explore multi-modal contextual knowledge learned through multi-modal masked language modeling to provide explicit localization guidance for novel classes in open-vocabulary object detection (OVD). Intuitively, a well-modeled and correctly predicted masked concept word should effectively capture the textual contexts, visual contexts, and the cross-modal correspondence between texts and regions, thereby automatically activating high attention on corresponding regions. In light of this, we propose a multi-modal contextual knowledge distillation framework, MMC-Det, to explicitly supervise a student detector with the context-aware attention of the masked concept words in a teacher fusion transformer. The teacher fusion transformer is trained with our newly proposed diverse multi-modal masked language modeling (D-MLM) strategy, which significantly enhances the fine-grained region-level visual context modeling in the fusion transformer. The proposed distillation process provides additional contextual guidance to the concept-region matching of the detector, thereby further improving the OVD performance. Extensive experiments performed upon various detection datasets show the effectiveness of our multi-modal context learning strategy.

引用

页码：6253 / 6267

页数：15

共 50 条

[31] Deep learning based object detection from multi-modal sensors: an overview
Liu, Ye
Meng, Shiyang
Wang, Hongzhang
Liu, Jun
MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (07) : 19841 - 19870
[32] Open-Vocabulary Animal Keypoint Detection with Semantic-Feature Matching
Zhang, Hao
Xu, Lumin
Lai, Shenqi
Shao, Wenqi
Zheng, Nanning
Luo, Ping
Qiao, Yu
Zhang, Kaipeng
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2024, 132 (12) : 5741 - 5758
[33] MLKD-CLIP: Multi-layer Feature Knowledge Distillation of CLIP for Open-vocabulary Action Recognition
Jingjing Wang
Junyong Ye
Xinyuan Liu
Youwei Li
Guangyi Xu
Chaoming Zheng
Multimedia Systems, 2025, 31 (3)
[34] Contextual and Cross-Modal Interaction for Multi-Modal Speech Emotion Recognition
Yang, Dingkang
Huang, Shuai
Liu, Yang
Zhang, Lihua
IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 2093 - 2097
[35] UniTR: A Unified TRansformer-Based Framework for Co-Object and Multi-Modal Saliency Detection
Guo, Ruohao
Ying, Xianghua
Qi, Yanyu
Qu, Liao
IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 7622 - 7635
[36] DOZE: A Dataset for Open-Vocabulary Zero-Shot Object Navigation in Dynamic Environments
Ma, Ji
Dai, Hongming
Mu, Yao
Wu, Pengying
Wang, Hao
Chi, Xiaowei
Fei, Yang
Zhang, Shanghang
Liu, Chang
IEEE ROBOTICS AND AUTOMATION LETTERS, 2024, 9 (09): : 7389 - 7396
[37] Open-vocabulary multi-label classification with visual and textual features fusion
Liu, Tongtong
Yang, Chen
Chen, Guoqiang
Li, Wenhui
VISUAL COMPUTER, 2024, : 6027 - 6039
[38] Knowledge Enhanced Vision and Language Model for Multi-Modal Fake News Detection
Gao, Xingyu
Wang, Xi
Chen, Zhenyu
Zhou, Wei
Hoi, Steven C. H.
IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 8312 - 8322
[39] Multi-Modal Prototypes for Few-Shot Object Detection in Remote Sensing Images
Liu, Yanxing
Pan, Zongxu
Yang, Jianwei
Zhou, Peiling
Zhang, Bingchen
REMOTE SENSING, 2024, 16 (24)
[40] Small Object Detection Technology Using Multi-Modal Data Based on Deep Learning
Park, Chi-Won
Seo, Yuri
Sun, Teh-Jen
Lee, Ga-Won
Huh, Eui-Nam
2023 INTERNATIONAL CONFERENCE ON INFORMATION NETWORKING, ICOIN, 2023, : 420 - 422

← 1 2 3 4 5 →