Visual Question Answering Research on Joint Knowledge and Visual Information Reasoning

被引:0
作者
Su, Zhenqiang [1 ]
Gou, Gang [1 ]
机构
[1] State Key Laboratory of Public Big Data, College of Computer Science and Technology, Guizhou University, Guiyang
关键词
attention mechanism; external knowledge; feature fusion; multimodal alignment; visual question answering;
D O I
10.3778/j.issn.1002-8331.2209-0456
中图分类号
学科分类号
摘要
As a task in the multimodal field, visual question answering requires fusion and reasoning of the features of different modalities, which has important application value. In traditional visual question answering, the answer to the question can be well reasoned only by relying on the visual information of the image. However, pure visual information cannot meet the diverse question-answering needs in real-world scenarios. Knowledge plays an important role in visual question answering and can well assist question answering. Knowledge-based open visual question answering needs to correlate external knowledge to achieve cross-modal scene understanding. In order to better integrate visual information and related external knowledge, a bilinear structure for joint knowledge and visual information reasoning is proposed, and a dual-guided attention module for knowledge representation by image features and question features is designed. Firstly, the model uses the pre-trained vision-language model to obtain the feature representation and visual reasoning information of the question and image, Secondly, the similarity matrix is used to calculate the image object area under the semantic alignment of the question, and then the regional features after the joint alignment of the question features jointly guide the knowledge representation to obtain knowledge reasoning information. Finally, the visual reasoning information and the knowledge reasoning information are fused to get the final answer. The experimental results on the OK-VQA dataset show that the accuracy of the model is 1.97 percentage points and 4.82 percentage points higher than the two baseline methods, respectively, which verifies the effectiveness of the model. © 2016 Chinese Medical Journals Publishing House Co.Ltd. All rights reserved.
引用
收藏
页码:95 / 102
页数:7
相关论文
共 32 条
[1]  
ANTOL S, AGRAWAL A, LU J, Et al., VQA: visual questio answering, Proceedings of the IEEE International Confer ence on Computer Vision, pp. 2425-2433, (2015)
[2]  
MIKOLOV T, CHEN K, CORRADO G, Et al., Efficient esti mation of word representations in vector space, Proceeding of the 1st International Conference on Learning Representa tions, pp. 1-12, (2013)
[3]  
PENNINGTON J, SOCHER R, MANNING C D., Glove: globa vectors for word representation, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532-1543, (2014)
[4]  
KENTON J D M W C, TOUTANOVA L K., BERT: pre-training of deep bidirectional transformers for language understanding, Proceedings of NAACL-HLT, pp. 4171-4186, (2019)
[5]  
SIMONYAN K, ZISSERMAN A., Very deep convolutional neworks for large-scale image recognition, Proceedings of the 3rd International Conference on Learing Representations, pp. 1-14, (2015)
[6]  
HE K, ZHANG X, REN S, Et al., Deep residual learning for image recognition, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770-778, (2016)
[7]  
REN S, HE K, Girshick R, Et al., Faster R-CNN: towards real-time object detection with region proposal networks, Advances in Neural Information Processing Systems, pp. 1137-1149, (2017)
[8]  
MALINOWSKI M, ROHRBACH M, FRITZ M., Ask your neurons: a neural-based approach to answering questions about images, Proceedings of the IEEE International Conference on Computer Vision, pp. 1-9, (2015)
[9]  
GRAVES A., Long short-term memory, Supervised Sequence Labelling with Recurrent Neural Networks, pp. 37-45, (2012)
[10]  
REN M, KIROS R, ZEMEL R., Image question answering: a visual semantic embedding model and a new dataset, Advances in Neural Information Processing Systems, (2015)