Advanced Visual and Textual Co-context Aware Attention Network with Dependent Multimodal Fusion Block for Visual Question Answering

被引：1

作者：

Asri H.S. ^{[1
]}

Safabakhsh R. ^{[1
]}

机构：

[1] Computer Engineering Department, Amirkabir University of Technology (Tehran Polytechnic), Tehran

来源：

Multimedia Tools and Applications | 2024年 / 83卷 / 40期

关键词：

Dependent multimodal fusion block; Question-level and word-level visual attention mechanisms; Textual context-aware attention; Universal sentences encoder;

D O I：

10.1007/s11042-024-18871-z

中图分类号：

学科分类号：

摘要：

Visual question answering (VQA) is a multimodal task requiring a simultaneous understanding of both visual and textual content. Therefore, image and question comprehension, finding a dense interaction among words and regions, and inference knowledge are the cores of VQA. In this paper, we propose the Advanced Visual and Textual Co-context Aware Attention Network with Dependent Multimodal Fusion Block for Visual Question Answering (ACOCAD), consisting of the image and the question representations and three proposed mechanisms: textual context-aware attention, a question-level & word-level visual attention, and a dependent multimodal fusion block. The textual context-aware attention mechanism marks the keywords of the question and captures rich features by modeling a context-aware unit beside the Universal Sentence Encoder model (USE) and a self-attention unit. The advanced visual attention approach is applied to attend on the regions with the aim of question-level and word-level visual attention. The dependent multimodal fusion block is employed to enhance associating keywords with key regions and generate more efficient vectors. Three sub-models are defined based on the three proposed mechanisms, and one ablation study is conducted on the benchmarks GQA and VQA-v2 datasets to evaluate the effectiveness of each mechanism of our ACOCAD model. Then, another ablation study for the overall accuracy of the ACOCAD model is carried out on one of the hyper-parameters to find its optimal value. Moreover, we explore how the Dependent Multimodal Fusion Block may relieve limitations of prior methods in answering questions including homograph words. In addition, to address the challenge regarding the length of question words, the potential efficiency of the USE model and the Visual Attention Mechanism are analyzed. For further review, a qualitative evaluation is done to visualize the effectiveness of the ACOCAD model using some samples. The results demonstrate that the ACOCAD model outperforms four out of seven criteria in the GQA dataset, and its overall accuracy criterion reaches 57.37%. Furthermore, our model achieves a significant enhancement compared to the previous state-of-the-art models and reaches 87.43%, 71.02%, and 71.18% accuracies in the Yes/No question type, overall test-dev dataset, and overall test-std of VQA-v2, respectively. Moreover, one of these sub-models attains the best accuracy of 60.95% among all models for the other question type. © The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2024.

引用

页码：87959 / 87986

页数：27

共 52 条

[1]

Krizhevsky A., Sutskever I., Hinton G.E., Imagenet classification with deep convolutional neural networks, Adv Neural Inform Proc Syst (NIPS), (2012)

[2]

Pham D.L., Xu C., Prince J.L., A survey of current methods in medical image segmentation, Annual Rev Biomed Eng, 2, 3, pp. 315-337, (2020)

[3]

Guoqiang Yu, and Jin Chen (2023), Latent Diffusion Model for Medical Image Standardization and Enhancement. Arxiv Preprint Arxiv

[4]

Bolhassani M., Oksuz I., Semi-Supervised Segmentation of Multi-vendor and Multi-center Cardiac MRI. In 2021 29th Signal Processing and Communications Applications Conference (SIU) (pp. 1-4), IEEE, (2021)

[5]

Mohtasebi M., Huang C., Zhao M., Mazdeyasna S., Liu, X, (2023)

[6]

Irwin D., Mazdeyasna S., Huang C., Mohtasebi M., Lui X., Chen L., Yu G., Near-infrared Speckle Contrast Diffuse Correlation Tomography for Noncontact Imaging of Tissue Blood Flow Distribution, (2022)

[7]

Kowsari K., JafariMeimandi K., Heidarysafa M., Mendu S., Barnes L., Brown D., Text classification algorithms: A survey, Information, 10, 4, (2019)

[8]

Sutskever I., Vinyals O., Le Q.V., Sequence to sequence learning with neural networks, Advances in Neural Information Processing Systems, 27, (2014)

[9]

Nawar N., Omar E.-G., Loknath S.S., Giridhar R.B., Social media for exploring adverse drug events associated with multiple sclerosis, (2022)

[10]

Mao J., Huang J., Toshev A., Camburu O., Yuille A.L., Murphy K., Generation and comprehension of unambiguous object descriptions, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11-20, (2016)

← 1 2 3 4 5 6 →