CLVIN: Complete language-vision interaction network for visual question answering

被引：83

作者：

Chen, Chongqing ^{[1
]}

Han, Dezhi ^{[1
]}

Shen, Xiang ^{[1
]}

机构：

[1] Shanghai Maritime Univ, Sch Informat Engn, Shanghai 201306, Peoples R China

来源：

KNOWLEDGE-BASED SYSTEMS | 2023年 / 275卷

基金：

中国国家自然科学基金; 上海市自然科学基金;

关键词：

Interactive modeling; Multimodal information; Language-vision interaction; Complete interaction; E-D mode; ATTENTION;

D O I：

10.1016/j.knosys.2023.110706

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The emergence of the Transformer optimizes the interactive modeling of multimodal information in visual question answering (VQA) tasks, helping machines better understand multimodal information. The existing Transformer-based end-to-end methods have made some achievements in applying the Encoder-Decoder (E-D) mode or realizing complete interaction. However, almost no methods combine the advantages of the two well and give full play to them. Thus, this paper designs a complete language-vision interaction network (CLVIN) for VQA based on the implementation of the quadratic E-D mode. Based on the core framework of the modular co-attention network (MCAN), CLVIN achieves the complete interaction of multimodal information by using the E-D mode again, realizing the rational distribution of the question words' weight information. In addition, to reduce the additional consumption of time and memory caused by introducing the quadratic E-D mode, this paper proposes a compact method called CLVIN-c through optimizing the underlying implementation of the scaled dot-product attention in Transformer. Finally, a series of experimental results based on the dataset VQA-v2.0 and CLEVR show that CLVIN has a significant performance improvement, and CLVIN-c achieves further optimizations in model size and performance. Code is available at https://github.com/RainyMoo/myvqa.& COPY; 2023 Elsevier B.V. All rights reserved.

引用

页数：13

共 51 条

[1] VQA: Visual Question Answering
Agrawal, Aishwarya
Lu, Jiasen
Antol, Stanislaw
Mitchell, Margaret
Zitnick, C. Lawrence
Parikh, Devi
Batra, Dhruv
[J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2017, 123 (01) : 4 - 31
[2] Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
Anderson, Peter
He, Xiaodong
Buehler, Chris
Teney, Damien
Johnson, Mark
Gould, Stephen
Zhang, Lei
[J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6077 - 6086
[3] Ba L.J., 2016, arXiv
[4] CAAN: Context-Aware attention network for visual question answering
Chen, Chongqing
Han, Dezhi
Chang, Chin -Chen
[J]. PATTERN RECOGNITION, 2022, 132
[5] Integrating information theory and adversarial learning for cross-modal retrieval
Chen, Wei
Liu, Yu
Bakker, Erwin M.
Lew, Michael S.
[J]. PATTERN RECOGNITION, 2021, 117
[6] Chen YC, 2019, AEBMR ADV ECON, V106, P104, DOI 10.1007/978-3-030-58577-8_7
[7] Visual Grounding Via Accumulated Attention
Deng, Chaorui
Wu, Qi
Wu, Qingyao
Hu, Fuyuan
Lyu, Fan
Tan, Mingkui
[J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (03) : 1670 - 1684
[8] Du Yunhao., 2022, 2022 IEEE INT C MULT, P1, DOI DOI 10.1109/ICME52920.2022.9859880
[9] Improved Fusion of Visual and Language Representations by Dense Symmetric Co-Attention for Visual Question Answering
Duy-Kien Nguyen
Okatani, Takayuki
[J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6087 - 6096
[10] Accuracy vs. complexity: A trade-off in visual question answering models
Farazi, Moshiur
Khan, Salman
Barnes, Nick
[J]. PATTERN RECOGNITION, 2021, 120 (120)

← 1 2 3 4 5 6 →