TransVG plus plus : End-to-End Visual Grounding With Language Conditioned Vision Transformer

被引：26

作者：

Deng, Jiajun ^{[1
]}

Yang, Zhengyuan ^{[2
]}

Liu, Daqing ^{[3
]}

Chen, Tianlang ^{[4
]}

Zhou, Wengang ^{[1
,5
]}

Zhang, Yanyong ^{[1
,5
]}

Li, Houqiang ^{[1
,5
]}

Ouyang, Wanli ^{[6
]}

机构：

[1] Univ Sci & Technol China, Hefei 230052, Anhui, Peoples R China

[2] Microsoft, Redmond, WA 98052 USA

[3] JD Explore Acad, Beijing 100101, Peoples R China

[4] Amazon, Seattle, WA 98109 USA

[5] Hefei Comprehens Natl Sci Ctr, Inst Artificial Intelligence, Hefei 230031, Peoples R China

[6] Shanghai Artificial Intelligence Lab, Shanghai 200030, Peoples R China

来源：

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE | 2023年 / 45卷 / 11期

关键词：

Transformers; Visualization; Grounding; Proposals; Pipelines; Convolutional neural networks; Cognition; Deep learning; transformer network; vision and language; visual grounding;

D O I：

10.1109/TPAMI.2023.3296823

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In this work, we explore neat yet effective Transformer-based frameworks for visual grounding. The previous methods generally address the core problem of visual grounding, i.e., multi-modal fusion and reasoning, with manually-designed mechanisms. Such heuristic designs are not only complicated but also make models easily overfit specific data distributions. To avoid this, we first propose TransVG, which establishes multi-modal correspondences by Transformers and localizes referred regions by directly regressing box coordinates. We empirically show that complicated fusion modules can be replaced by a simple stack of Transformer encoder layers with higher performance. However, the core fusion Transformer in TransVG is stand-alone against uni-modal encoders, and thus should be trained from scratch on limited visual grounding data, which makes it hard to be optimized and leads to sub-optimal performance. To this end, we further introduce TransVG++ to make two-fold improvements. For one thing, we upgrade our framework to a purely Transformer-based one by leveraging Vision Transformer (ViT) for vision feature encoding. For another, we devise Language Conditioned Vision Transformer that removes external fusion modules and reuses the uni-modal ViT for vision-language fusion at the intermediate layers. We conduct extensive experiments on five prevalent datasets, and report a series of state-of-the-art records.

引用

页码：13636 / 13652

页数：17

共 94 条

[1] Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering [J].

Anderson, Peter ;

He, Xiaodong ;

Buehler, Chris ;

Teney, Damien ;

Johnson, Mark ;

Gould, Stephen ;

Zhang, Lei .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6077-6086

[2]

Ba JL, 2016, Layer normalization

[3] G3RAPHGROUND: Graph-based Language Grounding [J].

Bajaj, Mohit ;

Wang, Lanjun ;

Sigal, Leonid .

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :4280-4289

[4]

Bolme DS, 2010, PROC CVPR IEEE, P2544, DOI 10.1109/CVPR.2010.5539960

[5]

Carion N, 2020, Img Proc Comp Vis Re, V12346, P213, DOI 10.1007/978-3-030-58452-8_13

[6] Query-guided Regression Network with Context Policy for Phrase Grounding [J].

Chen, Kan ;

Kovvuri, Rama ;

Nevatia, Ram .

2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :824-832

[7]

Chen LC, 2016, Arxiv, DOI arXiv:1412.7062

[8]

Chen L, 2021, AAAI CONF ARTIF INTE, V35, P1036

[9]

Chen M, 2020, PR MACH LEARN RES, V119

[10] Multi-Modal Dynamic Graph Transformer for Visual Grounding [J].

Chen, Sijia ;

Li, Baochun .

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :15513-15522

← 1 2 3 4 5 6 7 8 9 10 →