TransVG plus plus : End-to-End Visual Grounding With Language Conditioned Vision Transformer

被引:26
作者
Deng, Jiajun [1 ]
Yang, Zhengyuan [2 ]
Liu, Daqing [3 ]
Chen, Tianlang [4 ]
Zhou, Wengang [1 ,5 ]
Zhang, Yanyong [1 ,5 ]
Li, Houqiang [1 ,5 ]
Ouyang, Wanli [6 ]
机构
[1] Univ Sci & Technol China, Hefei 230052, Anhui, Peoples R China
[2] Microsoft, Redmond, WA 98052 USA
[3] JD Explore Acad, Beijing 100101, Peoples R China
[4] Amazon, Seattle, WA 98109 USA
[5] Hefei Comprehens Natl Sci Ctr, Inst Artificial Intelligence, Hefei 230031, Peoples R China
[6] Shanghai Artificial Intelligence Lab, Shanghai 200030, Peoples R China
关键词
Transformers; Visualization; Grounding; Proposals; Pipelines; Convolutional neural networks; Cognition; Deep learning; transformer network; vision and language; visual grounding;
D O I
10.1109/TPAMI.2023.3296823
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this work, we explore neat yet effective Transformer-based frameworks for visual grounding. The previous methods generally address the core problem of visual grounding, i.e., multi-modal fusion and reasoning, with manually-designed mechanisms. Such heuristic designs are not only complicated but also make models easily overfit specific data distributions. To avoid this, we first propose TransVG, which establishes multi-modal correspondences by Transformers and localizes referred regions by directly regressing box coordinates. We empirically show that complicated fusion modules can be replaced by a simple stack of Transformer encoder layers with higher performance. However, the core fusion Transformer in TransVG is stand-alone against uni-modal encoders, and thus should be trained from scratch on limited visual grounding data, which makes it hard to be optimized and leads to sub-optimal performance. To this end, we further introduce TransVG++ to make two-fold improvements. For one thing, we upgrade our framework to a purely Transformer-based one by leveraging Vision Transformer (ViT) for vision feature encoding. For another, we devise Language Conditioned Vision Transformer that removes external fusion modules and reuses the uni-modal ViT for vision-language fusion at the intermediate layers. We conduct extensive experiments on five prevalent datasets, and report a series of state-of-the-art records.
引用
收藏
页码:13636 / 13652
页数:17
相关论文
共 94 条
[1]   Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering [J].
Anderson, Peter ;
He, Xiaodong ;
Buehler, Chris ;
Teney, Damien ;
Johnson, Mark ;
Gould, Stephen ;
Zhang, Lei .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6077-6086
[2]  
Ba JL, 2016, Layer normalization
[3]   G3RAPHGROUND: Graph-based Language Grounding [J].
Bajaj, Mohit ;
Wang, Lanjun ;
Sigal, Leonid .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :4280-4289
[4]  
Bolme DS, 2010, PROC CVPR IEEE, P2544, DOI 10.1109/CVPR.2010.5539960
[5]  
Carion N, 2020, Img Proc Comp Vis Re, V12346, P213, DOI 10.1007/978-3-030-58452-8_13
[6]   Query-guided Regression Network with Context Policy for Phrase Grounding [J].
Chen, Kan ;
Kovvuri, Rama ;
Nevatia, Ram .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :824-832
[7]  
Chen LC, 2016, Arxiv, DOI arXiv:1412.7062
[8]  
Chen L, 2021, AAAI CONF ARTIF INTE, V35, P1036
[9]  
Chen M, 2020, PR MACH LEARN RES, V119
[10]   Multi-Modal Dynamic Graph Transformer for Visual Grounding [J].
Chen, Sijia ;
Li, Baochun .
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :15513-15522