Knowledge-Based Visual Question Generation

被引：19

作者：

Xie, Jiayuan ^{[1
,2
]}

Fang, Wenhao ^{[1
,2
]}

Cai, Yi ^{[1
,2
,3
]}

Huang, Qingbao ^{[1
,2
,4
,5
,6
]}

Li, Qing ^{[7
]}

机构：

[1] South China Univ Technol, Sch Software Engn, Guangzhou 510006, Peoples R China

[2] South China Univ Technol, Key Lab Big Data & Intelligent Robot, Guangzhou 510006, Peoples R China

[3] Pazhou Lab, Guangzhou 510335, Peoples R China

[4] Guangxi Univ, Sch Elect Engn, Nanning 530001, Peoples R China

[5] Guangxi Univ, Guangxi Key Lab Multimedia Commun & Network Techn, Nanning 530001, Peoples R China

[6] Guangxi Univ, Inst Artificial Intelligence, Nanning 530001, Peoples R China

[7] Hong Kong Polytech Univ, Dept Comp, Hong Kong, Peoples R China

来源：

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY | 2022年 / 32卷 / 11期

关键词：

Visualization; Feature extraction; Task analysis; Knowledge based systems; Knowledge representation; Decoding; Image edge detection; Visual question generation; knowledge-based; multimodal;

D O I：

10.1109/TCSVT.2022.3189242

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Visual question generation task aims to generate meaningful questions about an image targeting an answer. Existing methods focus on the visual concepts in the image for question generation. However, humans inevitably use their knowledge related to visual objects in images to construct questions. In this paper, we propose a knowledge-based visual question generation model that can integrate visual concepts and non-visual knowledge to generate questions. To obtain visual concepts, we utilize a pre-trained object detection model to obtain object-level features of each object in the image. To obtain useful non-visual knowledge, we first retrieve the knowledge from the knowledge-base related to the visual objects in the image. Considering that not all retrieved knowledge is helpful for this task, we introduce an answer-aware module to capture the candidate knowledge related to the answer from the retrieved knowledge, which ensures that the generated content can be targeted at the answer. Finally, object-level representations containing visual concepts and non-visual knowledge are sent to a decoder module to generate questions. Extensive experiments on the FVQA and KBVQA datasets show that the proposed model outperforms the state-of-the-art models.

引用

页码：7547 / 7558

页数：12

共 48 条

[1] VQA: Visual Question Answering [J].

Antol, Stanislaw ;

Agrawal, Aishwarya ;

Lu, Jiasen ;

Mitchell, Margaret ;

Batra, Dhruv ;

Zitnick, C. Lawrence ;

Parikh, Devi .

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2425-2433

[2] Difficulty-Controllable Visual Question Generation [J].

Chen, Feng ;

Xie, Jiayuan ;

Cai, Yi ;

Wang, Tao ;

Li, Qing .

WEB AND BIG DATA, APWEB-WAIM 2021, PT I, 2021, 12858 :332-347

[3]

Chen XL, 2015, Arxiv, DOI arXiv:1504.00325

[4]

Fan Z., 2018, P 27 INT C COMP LING, P1763

[5]

Fan ZH, 2018, PROCEEDINGS OF THE TWENTY-SEVENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, P4048

[6]

Gao HY, 2015, ADV NEUR IN, V28

[7] Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering [J].

Goyal, Yash ;

Khot, Tejas ;

Summers-Stay, Douglas ;

Batra, Dhruv ;

Parikh, Devi .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :6325-6334

[8] Multi-Turn Video Question Generation via Reinforced Multi-Choice Attention Network [J].

Guo, Zhaoyu ;

Zhao, Zhou ;

Jin, Weike ;

Wei, Zhicheng ;

Yang, Min ;

Wang, Nannan ;

Yuan, Nicholas Jing .

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2021, 31 (05) :1697-1710

[9]

He S., 2015, P 24 ACM INT C INF K, P623

[10]

Hochreiter S, 1997, NEURAL COMPUT, V9, P1735, DOI [10.1162/neco.1997.9.8.1735, 10.1162/neco.1997.9.1.1, 10.1007/978-3-642-24797-2]

← 1 2 3 4 5 →