Weakly Supervised Grounding for VQA in Vision-Language Transformers

被引:6
作者
Khan, Aisha Urooj [1 ]
Kuehne, Hilde [2 ,3 ]
Gan, Chuang [3 ]
Lobo, Niels Da Vitoria [1 ]
Shah, Mubarak [1 ]
机构
[1] Univ Cent Florida, Orlando, FL 32816 USA
[2] Goethe Univ Frankfurt, Frankfurt, Hesse, Germany
[3] MIT, IBM Watson AI Lab, 77 Massachusetts Ave, Cambridge, MA 02139 USA
来源
COMPUTER VISION - ECCV 2022, PT XXXV | 2022年 / 13695卷
关键词
Multimodal understanding; Visual grounding; Visual question answering; Vision and language;
D O I
10.1007/978-3-031-19833-5_38
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Transformers for visual-language representation learning have been getting a lot of interest and shown tremendous performance on visual question answering (VQA) and grounding. However, most systems that show good performance of those tasks still rely on pre-trained object detectors during training, which limits their applicability to the object classes available for those detectors. To mitigate this limitation, this paper focuses on the problem of weakly supervised grounding in the context of visual question answering in transformers. Our approach leverages capsules by transforming each visual token into a capsule representation in the visual encoder; it then uses activations from language self-attention layers as a text-guided selection module to mask those capsules before they are forwarded to the next layer. We evaluate our approach on the challenging GQA as well as VQA-HAT dataset for VQA grounding. Our experiments show that: while removing the information of masked objects from standard transformer architectures leads to a significant drop in performance, the integration of capsules significantly improves the grounding ability of such systems and provides new state-of-the-art results compared to other approaches in the field. (Code is available at https://github.com/aurooj/WSG-VQA-VLTransformers)
引用
收藏
页码:652 / 670
页数:19
相关论文
共 66 条
[1]  
Abacha A.B., 2019, PROC C LABS EVAL FO, V2
[2]   VQA: Visual Question Answering [J].
Antol, Stanislaw ;
Agrawal, Aishwarya ;
Lu, Jiasen ;
Mitchell, Margaret ;
Batra, Dhruv ;
Zitnick, C. Lawrence ;
Parikh, Devi .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2425-2433
[3]  
Arbelle A, 2021, Arxiv, DOI arXiv:2104.09829
[4]   Emerging Properties in Self-Supervised Vision Transformers [J].
Caron, Mathilde ;
Touvron, Hugo ;
Misra, Ishan ;
Jegou, Herve ;
Mairal, Julien ;
Bojanowski, Piotr ;
Joulin, Armand .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :9630-9640
[5]   Knowledge Aided Consistency for Weakly Supervised Phrase Grounding [J].
Chen, Kan ;
Gao, Jiyang ;
Nevatia, Ram .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :4042-4050
[6]   CrDoCo: Pixel-level Domain Transfer with Cross-Domain Consistency [J].
Chen, Yun-Chun ;
Lin, Yen-Yu ;
Yang, Ming-Hsuan ;
Huang, Jia-Bin .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :1791-1800
[7]  
Chen ZF, 2019, Arxiv, DOI arXiv:1906.02549
[8]  
Das A., 2016, C EMPIRICAL METHODS
[9]   Align2Ground: Weakly Supervised Phrase Grounding Guided by Image-Caption Alignment [J].
Datta, Samyak ;
Sikka, Karan ;
Roy, Anirban ;
Ahuja, Karuna ;
Parikh, Devi ;
Divakaran, Ajay .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :2601-2610
[10]   VirTex: Learning Visual Representations from Textual Annotations [J].
Desai, Karan ;
Johnson, Justin .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :11157-11168