A Visual Grounding Method with Contrastive Learning Large Model

被引：0

作者：

Lu, Qing-Yang ^{[1
]}

Yuan, Guang-Lin ^{[2
]}

Zhu, Hong ^{[2
]}

Qin, Xiao-Yan ^{[2
]}

Xue, Mo-Gen ^{[2
,3
]}

机构：

[1] Graduate brigade, PLA Army Academy of Artillery and Air Defense, Anhui, Hefei,230031, China

[2] Department of Information Engineering, PLA Army Academy of Artillery and Air Defense, Anhui, Hefei,230031, China

[3] Anhui Province Key Laboratory of Polarization Imaging Detection Technology, Anhui, Hefei,230031, China

来源：

Tien Tzu Hsueh Pao/Acta Electronica Sinica | 2024年 / 52卷 / 10期

关键词：

Digital elevation model - Image enhancement - Image fusion - Semantics - Signal encoding;

D O I：

10.12263/DZXB.20230364

中图分类号：

学科分类号：

摘要：

The one-stage visual grounding method has received widespread attention due to its speed, which uses fused features of images and text to predict target boxes. However, existing methods do not align image and text features before feature fusion, which limits the accuracy of visual grounding. To solve this problem, this paper proposes a visual grounding method based on contrastive learning large model. This method extracts features of image and text with CLIP (Contrastive Language-Image Pre-training) which is a large-scale pre-trained model based on contrastive learning. It uses Transformer encoders to fuse the image-text features and predicts target boxes using multi-layer perceptron and fused features. The method can overcome the above shortcomings for the following reasons: It can extract highly aligned image-text features in semantics via the CLIP encoders. Meanwhile, it uses global attention to interactively fuse contextual features of images and text. The proposed method was experimentally validated on five datasets, and the experimental results show that compared to existing visual grounding methods, the proposed method has achieved an improvement in overall accuracy. © 2024 Chinese Institute of Electronics. All rights reserved.

引用

页码：3448 / 3458