Multi-Modal Dynamic Graph Transformer for Visual Grounding

被引：18

作者：

Chen, Sijia ^{[1
]}

Li, Baochun ^{[1
]}

机构：

[1] Univ Toronto, Dept Elect & Comp Engn, Toronto, ON, Canada

来源：

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022) | 2022年

关键词：

D O I：

10.1109/CVPR52688.2022.01509

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Visual grounding (VG) aims to align the correct regions of an image with a natural language query about that image. We found that existing VG methods are trapped by the single-stage grounding process that performs a sole evaluate-and-rank for meticulously prepared regions. Their performance depends on the density and quality of the candidate regions, and is capped by the inability to optimize the located regions continuously. To address these issues, we propose to remodel VG into a progressively optimized visual semantic alignment process. Our proposed multimodal dynamic graph transformer (M-DGT) achieves this by building upon the dynamic graph structure with regions as nodes and their semantic relations as edges. Starting from a few randomly initialized regions, M-DGT is able to make sustainable adjustments (i.e., 2D spatial transformation and deletion) to the nodes and edges of the graph based on multi-modal information and the graph feature, thereby efficiently shrinking the graph to approach the ground truth regions. Experiments show that with an average of 48 boxes as initialization, the performance of M-DGT on the Flickr3Ok Entities and RefCOCO datasets outperforms existing state-of-the-art methods by a substantial margin, in terms of both accuracy and Intersect over Union (IOU) scores. Furthermore, introducing M-DGT to optimize the predicted regions of existing methods can further significantly improve their performance. The source codes are available at https://github.com/iQua/M-DGT.

引用

页码：15513 / 15522

页数：10

共 48 条

[1]

Anderson P, 2018, PROC CVPR IEEE, P6077, DOI [10.1109/CVPR.2018.00636, 10.1002/ett.70087]

[2]

[Anonymous], 2020, COMP VIS ECCV 2020 1, DOI DOI 10.1109/VCIP49819.2020.9301790

[3]

[Anonymous], PROC CVPR IEEE

[4]

[Anonymous], 2021, P IEEE CVF C COMP VI, DOI DOI 10.1109/QRS-C55045.2021.00100

[5] End-to-End Object Detection with Transformers [J].

Carion, Nicolas ;

Massa, Francisco ;

Synnaeve, Gabriel ;

Usunier, Nicolas ;

Kirillov, Alexander ;

Zagoruyko, Sergey .

COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :213-229

[6] Ranking Support Vector Machine with Kernel Approximation [J].

Chen, Kai ;

Li, Rongchun ;

Dou, Yong ;

Liang, Zhengfa ;

Lv, Qi .

COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE, 2017, 2017

[7] Iterative Visual Reasoning Beyond Convolutions [J].

Chen, Xinlei ;

Li, Li-Jia ;

Li Fei-Fei ;

Gupta, Abhinav .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :7239-7248

[8]

Cucurull Guillem, 2017, ICLR 2018

[9] Align2Ground: Weakly Supervised Phrase Grounding Guided by Image-Caption Alignment [J].

Datta, Samyak ;

Sikka, Karan ;

Roy, Anirban ;

Ahuja, Karuna ;

Parikh, Devi ;

Divakaran, Ajay .

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :2601-2610

[10] A compact low loss high isolation DC-45GHz SPST switch in 0.13-μm CMOS process [J].

Deng, Chun ;

Yang, Hong-Qiang ;

Gong, Min .

MICROELECTRONICS JOURNAL, 2018, 80 :1-6

← 1 2 3 4 5 →