Adaptive sparse triple convolutional attention for enhanced visual question answering

被引:0
|
作者
Wang, Ronggui [1 ]
Chen, Hong [1 ]
Yang, Juan [1 ]
Xue, Lixia [1 ]
机构
[1] Hefei Univ Technol, Sch Comp Sci & Informat Engn, Hefei 230601, Peoples R China
来源
关键词
Visual question answering; Transformer; Sparse attention; Convolutional attention;
D O I
10.1007/s00371-025-03812-0
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
In this paper, we propose ASTCAN, an adaptive sparse triple convolutional attention network, designed to enhance visual question answering (VQA) by introducing innovative modifications to the standard Transformer architecture. Traditional VQA models often struggle with noise interference from irrelevant regions due to their inability to dynamically filter out extraneous features. ASTCAN addresses this limitation through an adaptive threshold sparse attention mechanism, which dynamically filters irrelevant features during training, significantly improving focus and efficiency. Additionally, we introduce a triple convolutional attention module, which extends the Transformer by capturing cross-dimensional interactions between spatial and channel features, further enhancing the model's reasoning ability. Extensive experiments on benchmark datasets demonstrate that ASTCAN outperforms most existing end-to-end methods, particularly in scenarios without pre-training, highlighting its effectiveness and potential for real-world applications. The code and datasets are publicly available to facilitate reproducibility and further research.
引用
收藏
页数:17
相关论文
共 50 条
  • [31] Fair Attention Network for Robust Visual Question Answering
    Bi, Yandong
    Jiang, Huajie
    Hu, Yongli
    Sun, Yanfeng
    Yin, Baocai
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (09) : 7870 - 7881
  • [32] Learning Visual Question Answering by Bootstrapping Hard Attention
    Malinowski, Mateusz
    Doersch, Carl
    Santoro, Adam
    Battaglia, Peter
    COMPUTER VISION - ECCV 2018, PT VI, 2018, 11210 : 3 - 20
  • [33] Co-Attention Network With Question Type for Visual Question Answering
    Yang, Chao
    Jiang, Mengqi
    Jiang, Bin
    Zhou, Weixin
    Li, Keqin
    IEEE ACCESS, 2019, 7 : 40771 - 40781
  • [34] Dual Attention and Question Categorization-Based Visual Question Answering
    Mishra A.
    Anand A.
    Guha P.
    IEEE Transactions on Artificial Intelligence, 2023, 4 (01): : 81 - 91
  • [35] An Enhanced Term Weighted Question Embedding for Visual Question Answering
    Manmadhan, Sruthy
    Kovoor, Binsu C.
    JOURNAL OF INFORMATION & KNOWLEDGE MANAGEMENT, 2022, 21 (02)
  • [36] From Pixels to Objects: Cubic Visual Attention for Visual Question Answering
    Song, Jingkuan
    Zeng, Pengpeng
    Gao, Lianli
    Shen, Heng Tao
    PROCEEDINGS OF THE TWENTY-SEVENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2018, : 906 - 912
  • [37] Relational reasoning and adaptive fusion for visual question answering
    Shen, Xiang
    Han, Dezhi
    Zong, Liang
    Guo, Zihan
    Hua, Jie
    APPLIED INTELLIGENCE, 2024, 54 (06) : 5062 - 5080
  • [38] Hierarchical Question-Image Co-Attention for Visual Question Answering
    Lu, Jiasen
    Yang, Jianwei
    Batra, Dhruv
    Parikh, Devi
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 29 (NIPS 2016), 2016, 29
  • [39] Multimodal Encoders and Decoders with Gate Attention for Visual Question Answering
    Li, Haiyan
    Han, Dezhi
    COMPUTER SCIENCE AND INFORMATION SYSTEMS, 2021, 18 (03) : 1023 - 1040
  • [40] Improving visual question answering using dropout and enhanced question encoder
    Fang, Zhiwei
    Liu, Jing
    Li, Yong
    Qiao, Yanyuan
    Lu, Hanqing
    PATTERN RECOGNITION, 2019, 90 : 404 - 414