Weakly-Supervised Grounding for VQA with Dual Visual-Linguistic Interaction

被引：0

作者：

Liu, Yi ^{[1
]}

Pan, Junwen ^{[1
]}

Wang, Qilong ^{[1
]}

Chen, Guanlin ^{[1
]}

Nie, Weiguo ^{[2
]}

Zhang, Yudong ^{[2
]}

Gao, Qian ^{[2
]}

Hu, Qinghua ^{[1
]}

Zhu, Pengfei ^{[1
]}

机构：

[1] Tianjin Univ, Coll Intelligence & Comp, Tianjin, Peoples R China

[2] Baidu Inc, Beijing, Peoples R China

来源：

ARTIFICIAL INTELLIGENCE, CICAI 2023, PT I | 2024年 / 14473卷

关键词：

Weakly-Supervised Learning; Visual Question Answering; Visual Grounding; LANGUAGE;

D O I：

10.1007/978-981-99-8850-1_13

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Visual question answer (VQA) grounding, aimed at locating the visual evidence associated with the answers while answering questions, has attracted increasing research interest. To locate the evidence, most existing methods extract attention maps in an unsupervised manner from pretrained VQA models. As only the text-related objective is considered during training, the attention map coarsely depicts the grounding region, resulting in poor interpretability. A straightforward solution for improving grounding accuracy is leveraging pixel-wise masks as strong supervision. However, precise per-pixel annotation is time-consuming and labor-intensive. To address above issues, this paper presents the weakly-supervised grounding for VQA, which learns an end-to-end Dual Visual-Linguistic Interaction (DaVi) network in a unified architecture with various low-cost annotations, such as click-, scribble- and box-level grounding labels. Specifically, to enable the visual mask prediction, DaVi proposes a language-based visual decoder that extends the previous VQA network. Since the visual decoder is guided with weak labels, we also present a Pseudo Grounding Refinement Module (PGRM) to refine the relatively coarse predictions as an additional constraint. Extensive experiments demonstrate that our weakly supervised DaVi significantly improves grounding performance even under the click-level supervision with one pixel annotation. Scribble-level supervision achieves 92% performance at a dramatically reduced annotation cost compared to its fully supervised counterpart. More essentially, weak visual grounding usually boosts the accuracy of text answers despite using inaccurate supervision.

引用

页码：156 / 169

页数：14

共 28 条

[1] Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering [J].

Agrawal, Aishwarya ;

Batra, Dhruv ;

Parikh, Devi ;

Kembhavi, Aniruddha .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :4971-4980

[2]

[Anonymous], P 2019 C N AM CHAPTE, DOI DOI 10.18653/V1/N19-1423

[3] VQA: Visual Question Answering [J].

Antol, Stanislaw ;

Agrawal, Aishwarya ;

Lu, Jiasen ;

Mitchell, Margaret ;

Batra, Dhruv ;

Zitnick, C. Lawrence ;

Parikh, Devi .

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2425-2433

[4] Grounding Answers for Visual Questions Asked by Visually Impaired People [J].

Chen, Chongyan ;

Anjum, Samreen ;

Gurari, Danna .

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :19076-19085

[5]

Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848

[6] Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering [J].

Goyal, Yash ;

Khot, Tejas ;

Summers-Stay, Douglas ;

Batra, Dhruv ;

Parikh, Devi .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :6325-6334

[7] VizWiz-Priv: A Dataset for Recognizing the Presence and Purpose of Private Visual Information in Images Taken by Blind People [J].

Gurari, Danna ;

Li, Qing ;

Lin, Chi ;

Zhao, Yinan ;

Guo, Anhong ;

Stangl, Abigale ;

Bigham, Jeffrey P. .

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :939-948

[8] CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning [J].

Johnson, Justin ;

Hariharan, Bharath ;

van der Maaten, Laurens ;

Fei-Fei, Li ;

Zitnick, C. Lawrence ;

Girshick, Ross .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :1988-1997

[9] Weakly Supervised Grounding for VQA in Vision-Language Transformers [J].

Khan, Aisha Urooj ;

Kuehne, Hilde ;

Gan, Chuang ;

Lobo, Niels Da Vitoria ;

Shah, Mubarak .

COMPUTER VISION - ECCV 2022, PT XXXV, 2022, 13695 :652-670

[10] Found a Reason for me? Weakly-supervised Grounded Visual Question Answering using Capsules [J].

Khan, Aisha Urooj ;

Kuehne, Hilde ;

Duarte, Kevin ;

Gan, Chuang ;

Lobo, Niels ;

Shah, Mubarak .

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :8461-8470

← 1 2 3 →