CLVIN: Complete language-vision interaction network for visual question answering

被引:83
作者
Chen, Chongqing [1 ]
Han, Dezhi [1 ]
Shen, Xiang [1 ]
机构
[1] Shanghai Maritime Univ, Sch Informat Engn, Shanghai 201306, Peoples R China
基金
中国国家自然科学基金; 上海市自然科学基金;
关键词
Interactive modeling; Multimodal information; Language-vision interaction; Complete interaction; E-D mode; ATTENTION;
D O I
10.1016/j.knosys.2023.110706
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The emergence of the Transformer optimizes the interactive modeling of multimodal information in visual question answering (VQA) tasks, helping machines better understand multimodal information. The existing Transformer-based end-to-end methods have made some achievements in applying the Encoder-Decoder (E-D) mode or realizing complete interaction. However, almost no methods combine the advantages of the two well and give full play to them. Thus, this paper designs a complete language-vision interaction network (CLVIN) for VQA based on the implementation of the quadratic E-D mode. Based on the core framework of the modular co-attention network (MCAN), CLVIN achieves the complete interaction of multimodal information by using the E-D mode again, realizing the rational distribution of the question words' weight information. In addition, to reduce the additional consumption of time and memory caused by introducing the quadratic E-D mode, this paper proposes a compact method called CLVIN-c through optimizing the underlying implementation of the scaled dot-product attention in Transformer. Finally, a series of experimental results based on the dataset VQA-v2.0 and CLEVR show that CLVIN has a significant performance improvement, and CLVIN-c achieves further optimizations in model size and performance. Code is available at https://github.com/RainyMoo/myvqa.& COPY; 2023 Elsevier B.V. All rights reserved.
引用
收藏
页数:13
相关论文
共 51 条
  • [1] VQA: Visual Question Answering
    Agrawal, Aishwarya
    Lu, Jiasen
    Antol, Stanislaw
    Mitchell, Margaret
    Zitnick, C. Lawrence
    Parikh, Devi
    Batra, Dhruv
    [J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2017, 123 (01) : 4 - 31
  • [2] Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
    Anderson, Peter
    He, Xiaodong
    Buehler, Chris
    Teney, Damien
    Johnson, Mark
    Gould, Stephen
    Zhang, Lei
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6077 - 6086
  • [3] Ba L.J., 2016, arXiv
  • [4] CAAN: Context-Aware attention network for visual question answering
    Chen, Chongqing
    Han, Dezhi
    Chang, Chin -Chen
    [J]. PATTERN RECOGNITION, 2022, 132
  • [5] Integrating information theory and adversarial learning for cross-modal retrieval
    Chen, Wei
    Liu, Yu
    Bakker, Erwin M.
    Lew, Michael S.
    [J]. PATTERN RECOGNITION, 2021, 117
  • [6] Chen YC, 2019, AEBMR ADV ECON, V106, P104, DOI 10.1007/978-3-030-58577-8_7
  • [7] Visual Grounding Via Accumulated Attention
    Deng, Chaorui
    Wu, Qi
    Wu, Qingyao
    Hu, Fuyuan
    Lyu, Fan
    Tan, Mingkui
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (03) : 1670 - 1684
  • [8] Du Yunhao., 2022, 2022 IEEE INT C MULT, P1, DOI DOI 10.1109/ICME52920.2022.9859880
  • [9] Improved Fusion of Visual and Language Representations by Dense Symmetric Co-Attention for Visual Question Answering
    Duy-Kien Nguyen
    Okatani, Takayuki
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6087 - 6096
  • [10] Accuracy vs. complexity: A trade-off in visual question answering models
    Farazi, Moshiur
    Khan, Salman
    Barnes, Nick
    [J]. PATTERN RECOGNITION, 2021, 120 (120)