TransVG plus plus : End-to-End Visual Grounding With Language Conditioned Vision Transformer

被引:26
作者
Deng, Jiajun [1 ]
Yang, Zhengyuan [2 ]
Liu, Daqing [3 ]
Chen, Tianlang [4 ]
Zhou, Wengang [1 ,5 ]
Zhang, Yanyong [1 ,5 ]
Li, Houqiang [1 ,5 ]
Ouyang, Wanli [6 ]
机构
[1] Univ Sci & Technol China, Hefei 230052, Anhui, Peoples R China
[2] Microsoft, Redmond, WA 98052 USA
[3] JD Explore Acad, Beijing 100101, Peoples R China
[4] Amazon, Seattle, WA 98109 USA
[5] Hefei Comprehens Natl Sci Ctr, Inst Artificial Intelligence, Hefei 230031, Peoples R China
[6] Shanghai Artificial Intelligence Lab, Shanghai 200030, Peoples R China
关键词
Transformers; Visualization; Grounding; Proposals; Pipelines; Convolutional neural networks; Cognition; Deep learning; transformer network; vision and language; visual grounding;
D O I
10.1109/TPAMI.2023.3296823
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this work, we explore neat yet effective Transformer-based frameworks for visual grounding. The previous methods generally address the core problem of visual grounding, i.e., multi-modal fusion and reasoning, with manually-designed mechanisms. Such heuristic designs are not only complicated but also make models easily overfit specific data distributions. To avoid this, we first propose TransVG, which establishes multi-modal correspondences by Transformers and localizes referred regions by directly regressing box coordinates. We empirically show that complicated fusion modules can be replaced by a simple stack of Transformer encoder layers with higher performance. However, the core fusion Transformer in TransVG is stand-alone against uni-modal encoders, and thus should be trained from scratch on limited visual grounding data, which makes it hard to be optimized and leads to sub-optimal performance. To this end, we further introduce TransVG++ to make two-fold improvements. For one thing, we upgrade our framework to a purely Transformer-based one by leveraging Vision Transformer (ViT) for vision feature encoding. For another, we devise Language Conditioned Vision Transformer that removes external fusion modules and reuses the uni-modal ViT for vision-language fusion at the intermediate layers. We conduct extensive experiments on five prevalent datasets, and report a series of state-of-the-art records.
引用
收藏
页码:13636 / 13652
页数:17
相关论文
共 94 条
[91]  
Zeng YH, 2020, Img Proc Comp Vis Re, V12361, P528, DOI 10.1007/978-3-030-58517-4_31
[92]   Grounding Referring Expressions in Images by Variational Context [J].
Zhang, Hanwang ;
Niu, Yulei ;
Chang, Shih-Fu .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :4158-4166
[93]  
Zhang Y, 2017, ADV SOC SCI EDUC HUM, V130, P557
[94]   Parallel Attention: A Unified Framework for Visual Object Discovery through Dialogs and Queries [J].
Zhuang, Bohan ;
Wu, Qi ;
Shen, Chunhua ;
Reid, Ian ;
van den Hengel, Anton .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :4252-4261