TransVG plus plus : End-to-End Visual Grounding With Language Conditioned Vision Transformer

被引：26

作者：

Deng, Jiajun ^{[1
]}

Yang, Zhengyuan ^{[2
]}

Liu, Daqing ^{[3
]}

Chen, Tianlang ^{[4
]}

Zhou, Wengang ^{[1
,5
]}

Zhang, Yanyong ^{[1
,5
]}

Li, Houqiang ^{[1
,5
]}

Ouyang, Wanli ^{[6
]}

机构：

[1] Univ Sci & Technol China, Hefei 230052, Anhui, Peoples R China

[2] Microsoft, Redmond, WA 98052 USA

[3] JD Explore Acad, Beijing 100101, Peoples R China

[4] Amazon, Seattle, WA 98109 USA

[5] Hefei Comprehens Natl Sci Ctr, Inst Artificial Intelligence, Hefei 230031, Peoples R China

[6] Shanghai Artificial Intelligence Lab, Shanghai 200030, Peoples R China

来源：

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE | 2023年 / 45卷 / 11期

关键词：

Transformers; Visualization; Grounding; Proposals; Pipelines; Convolutional neural networks; Cognition; Deep learning; transformer network; vision and language; visual grounding;

D O I：

10.1109/TPAMI.2023.3296823

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In this work, we explore neat yet effective Transformer-based frameworks for visual grounding. The previous methods generally address the core problem of visual grounding, i.e., multi-modal fusion and reasoning, with manually-designed mechanisms. Such heuristic designs are not only complicated but also make models easily overfit specific data distributions. To avoid this, we first propose TransVG, which establishes multi-modal correspondences by Transformers and localizes referred regions by directly regressing box coordinates. We empirically show that complicated fusion modules can be replaced by a simple stack of Transformer encoder layers with higher performance. However, the core fusion Transformer in TransVG is stand-alone against uni-modal encoders, and thus should be trained from scratch on limited visual grounding data, which makes it hard to be optimized and leads to sub-optimal performance. To this end, we further introduce TransVG++ to make two-fold improvements. For one thing, we upgrade our framework to a purely Transformer-based one by leveraging Vision Transformer (ViT) for vision feature encoding. For another, we devise Language Conditioned Vision Transformer that removes external fusion modules and reuses the uni-modal ViT for vision-language fusion at the intermediate layers. We conduct extensive experiments on five prevalent datasets, and report a series of state-of-the-art records.

引用

页码：13636 / 13652

页数：17

共 94 条

[91]

Zeng YH, 2020, Img Proc Comp Vis Re, V12361, P528, DOI 10.1007/978-3-030-58517-4_31

[92] Grounding Referring Expressions in Images by Variational Context [J].

Zhang, Hanwang ;

Niu, Yulei ;

Chang, Shih-Fu .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :4158-4166

[93]

Zhang Y, 2017, ADV SOC SCI EDUC HUM, V130, P557

[94] Parallel Attention: A Unified Framework for Visual Object Discovery through Dialogs and Queries [J].

Zhuang, Bohan ;

Wu, Qi ;

Shen, Chunhua ;

Reid, Ian ;

van den Hengel, Anton .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :4252-4261

← 1 2 3 4 5 6 7 8 9 10 →