Exploring Multi-Modal Contextual Knowledge for Open-Vocabulary Object Detection

被引：0

作者：

Xu, Yifan ^{[1
,2
]}

Zhang, Mengdan ^{[3
]}

Yang, Xiaoshan ^{[1
,2
]}

Xu, Changsheng ^{[1
,2
]}

机构：

[1] Chinese Acad Sci, Univ Chinese Acad Sci, Inst Automat, MAIS, Beijing 100190, Peoples R China

[2] Peng Cheng Lab, Shenzhen 518066, Peoples R China

[3] Tencent Youtu Lab, Shanghai 200233, Peoples R China

来源：

IEEE TRANSACTIONS ON IMAGE PROCESSING | 2024年 / 33卷

基金：

中国国家自然科学基金;

关键词：

Transformers; Visualization; Detectors; Object detection; Context modeling; Proposals; Location awareness; Annotations; Vocabulary; Training; open-vocabulary; contextual knowledge;

D O I：

10.1109/TIP.2024.3485518

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We explore multi-modal contextual knowledge learned through multi-modal masked language modeling to provide explicit localization guidance for novel classes in open-vocabulary object detection (OVD). Intuitively, a well-modeled and correctly predicted masked concept word should effectively capture the textual contexts, visual contexts, and the cross-modal correspondence between texts and regions, thereby automatically activating high attention on corresponding regions. In light of this, we propose a multi-modal contextual knowledge distillation framework, MMC-Det, to explicitly supervise a student detector with the context-aware attention of the masked concept words in a teacher fusion transformer. The teacher fusion transformer is trained with our newly proposed diverse multi-modal masked language modeling (D-MLM) strategy, which significantly enhances the fine-grained region-level visual context modeling in the fusion transformer. The proposed distillation process provides additional contextual guidance to the concept-region matching of the detector, thereby further improving the OVD performance. Extensive experiments performed upon various detection datasets show the effectiveness of our multi-modal context learning strategy.

引用

页码：6253 / 6267

页数：15

共 50 条

[41] Object detection based on multi-modal adaptive fusion using YOLOv3
Sheikh, Aarfa Bano
Baru, Apurva
Desai, Sanjana Shinde
Mangale, Supriya
JOURNAL OF APPLIED REMOTE SENSING, 2022, 16 (02)
[42] Multi-scale multi-modal fusion for object detection in autonomous driving based on selective kernel
Gao, Xin
Zhang, Guoying
Xiong, Yijin
MEASUREMENT, 2022, 194
[43] Exploiting Multi-Modal Synergies for Enhancing 3D Multi-Object Tracking
Xu, Xinglong
Ren, Weihong
Chen, Xi'ai
Fan, Huijie
Han, Zhi
Liu, Honghai
IEEE ROBOTICS AND AUTOMATION LETTERS, 2024, 9 (10): : 8643 - 8650
[44] Industrial object detection with multi-modal SSD: closing the gap between synthetic and real images
Julia Cohen
Carlos Crispim-Junior
Jean-Marc Chiappa
Laure Tougne Rodet
Multimedia Tools and Applications, 2024, 83 : 12111 - 12138
[45] Deep Multi-Modal Object Detection and Semantic Segmentation for Autonomous Driving: Datasets, Methods, and Challenges
Feng, Di
Haase-Schutz, Christian
Rosenbaum, Lars
Hertlein, Heinz
Glaser, Claudius
Timm, Fabian
Wiesbeck, Werner
Dietmayer, Klaus
IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, 2021, 22 (03) : 1341 - 1360
[46] Industrial object detection with multi-modal SSD: closing the gap between synthetic and real images
Cohen, Julia
Crispim-Junior, Carlos
Chiappa, Jean-Marc
Rodet, Laure Tougne
MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (04) : 12111 - 12138
[47] Multi-Modal System for Walking Safety for the Visually Impaired: Multi-Object Detection and Natural Language Generation
Lee, Jekyung
Cha, Kyung-Ae
Lee, Miran
APPLIED SCIENCES-BASEL, 2024, 14 (17):
[48] EPNet plus plus : Cascade Bi-Directional Fusion for Multi-Modal 3D Object Detection
Liu, Zhe
Huang, Tengteng
Li, Bingling
Chen, Xiwu
Wang, Xi
Bai, Xiang
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (07) : 8324 - 8341
[49] A Corresponding Region Fusion Framework for Multi-Modal Cervical Lesion Detection
Chen, Tingting
Zheng, Wenhao
Hu, Heping
Luo, Chunhua
Chen, Jintai
Yuan, Chunnv
Lu, Weiguo
Chen, Danny Z.
Gao, Honghao
Wu, Jian
IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2024, 21 (04) : 959 - 970
[50] Robust Domain Misinformation Detection via Multi-Modal Feature Alignment
Liu, Hui
Wang, Wenya
Sun, Hao
Rocha, Anderson
Li, Haoliang
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, 2024, 19 : 793 - 806

← 1 2 3 4 5 →