Learning neighbor-enhanced region representations and question-guided visual representations for visual question answering

被引:5
作者
Gao, Ling [1 ]
Zhang, Hongda [2 ]
Sheng, Nan [1 ]
Shi, Lida [2 ]
Xu, Hao [1 ,2 ]
机构
[1] Jilin Univ, Coll Comp Sci & Technol, Changchun 130012, Peoples R China
[2] Jilin Univ, Sch Artificial Intelligence, Changchun 130012, Peoples R China
基金
中国国家自然科学基金;
关键词
Visual question answering; Deep learning; Feature graph; Attention mechanism; Random walk;
D O I
10.1016/j.eswa.2023.122239
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Great strides have been made in visual question answering field (VQA) based on the application and development of deep learning in related research fields. Existing models in this field focus on the learning and fusion of visual and textual features. However, it is extremely crucial for VQA tasks to focus on the associations between image regions and use question information to enhance key features. In this paper, we propose a method for mining and integrating neighbor-enhanced region representations and question-guided visual representations. Particularly, the region feature graph is first constructed to integrate the features of all regions and the relationships between regions. Secondly, a random walk-based method is presented to acquire the neighbor-enhanced region representations, which combines the topological relationships of all region nodes in the graph. The question-guided vertical and horizontal dual attention mechanism is then proposed to enhance the region representation from the region level and the feature level, respectively. Finally, the enhanced region representation and question representation are integrated adaptively to achieve answer prediction. Convincible experiments show that our method achieves improvements and outperforms prior state-of-the-art methods on two competitive benchmarks, i.e., VQA v1 and VQA v2.
引用
收藏
页数:12
相关论文
共 71 条
[1]   Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering [J].
Anderson, Peter ;
He, Xiaodong ;
Buehler, Chris ;
Teney, Damien ;
Johnson, Mark ;
Gould, Stephen ;
Zhang, Lei .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6077-6086
[2]  
Andreas J., 2016, P 2016 C N AM CHAPT, DOI DOI 10.18653/V1/N16-1181
[3]   Neural Module Networks [J].
Andreas, Jacob ;
Rohrbach, Marcus ;
Darrell, Trevor ;
Klein, Dan .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :39-48
[4]  
[Anonymous], 2026, P 2016 C EMP METH NA, DOI [10.18653/v1/d16-1044, DOI 10.18653/V1/D16-1044]
[5]   VQA: Visual Question Answering [J].
Antol, Stanislaw ;
Agrawal, Aishwarya ;
Lu, Jiasen ;
Mitchell, Margaret ;
Batra, Dhruv ;
Zitnick, C. Lawrence ;
Parikh, Devi .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2425-2433
[6]   MUTAN: Multimodal Tucker Fusion for Visual Question Answering [J].
Ben-younes, Hedi ;
Cadene, Remi ;
Cord, Matthieu ;
Thome, Nicolas .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :2631-2639
[7]  
Boski M, 2017, 2017 10TH INTERNATIONAL WORKSHOP ON MULTIDIMENSIONAL (ND) SYSTEMS (NDS)
[8]   Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts [J].
Changpinyo, Soravit ;
Sharma, Piyush ;
Ding, Nan ;
Soricut, Radu .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :3557-3567
[9]  
Chen K, 2016, Arxiv, DOI arXiv:1511.05960
[10]  
Devlin J, 2019, Arxiv, DOI [arXiv:1810.04805, DOI 10.48550/ARXIV.1810.04805]