Adaptive sparse triple convolutional attention for enhanced visual question answering

被引:0
|
作者
Wang, Ronggui [1 ]
Chen, Hong [1 ]
Yang, Juan [1 ]
Xue, Lixia [1 ]
机构
[1] Hefei Univ Technol, Sch Comp Sci & Informat Engn, Hefei 230601, Peoples R China
来源
关键词
Visual question answering; Transformer; Sparse attention; Convolutional attention;
D O I
10.1007/s00371-025-03812-0
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
In this paper, we propose ASTCAN, an adaptive sparse triple convolutional attention network, designed to enhance visual question answering (VQA) by introducing innovative modifications to the standard Transformer architecture. Traditional VQA models often struggle with noise interference from irrelevant regions due to their inability to dynamically filter out extraneous features. ASTCAN addresses this limitation through an adaptive threshold sparse attention mechanism, which dynamically filters irrelevant features during training, significantly improving focus and efficiency. Additionally, we introduce a triple convolutional attention module, which extends the Transformer by capturing cross-dimensional interactions between spatial and channel features, further enhancing the model's reasoning ability. Extensive experiments on benchmark datasets demonstrate that ASTCAN outperforms most existing end-to-end methods, particularly in scenarios without pre-training, highlighting its effectiveness and potential for real-world applications. The code and datasets are publicly available to facilitate reproducibility and further research.
引用
收藏
页数:17
相关论文
共 50 条
  • [41] Local relation network with multilevel attention for visual question answering
    Sun, Bo
    Yao, Zeng
    Zhang, Yinghui
    Yu, Lejun
    JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2020, 73
  • [42] Focal Visual-Text Attention for Memex Question Answering
    Liang, Junwei
    Jiang, Lu
    Cao, Liangliang
    Kalantidis, Yannis
    Li, Li-Jia
    Hauptmann, Alexander G.
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2019, 41 (08) : 1893 - 1908
  • [43] Local self-attention in transformer for visual question answering
    Xiang Shen
    Dezhi Han
    Zihan Guo
    Chongqing Chen
    Jie Hua
    Gaofeng Luo
    Applied Intelligence, 2023, 53 : 16706 - 16723
  • [44] Latent Attention Network With Position Perception for Visual Question Answering
    Zhang, Jing
    Liu, Xiaoqiang
    Wang, Zhe
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2025, 36 (03) : 5059 - 5069
  • [45] Stacked Self-Attention Networks for Visual Question Answering
    Sun, Qiang
    Fu, Yanwei
    ICMR'19: PROCEEDINGS OF THE 2019 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, 2019, : 207 - 211
  • [46] Stacked Attention based Textbook Visual Question Answering with BERT
    Aishwarya, R.
    Sarath, P.
    Rahman, Shibil P.
    Sneha, U.
    Manmadhan, Sruthy
    2022 IEEE 19TH INDIA COUNCIL INTERNATIONAL CONFERENCE, INDICON, 2022,
  • [47] Multi-stage Attention based Visual Question Answering
    Mishra, Aakansha
    Anand, Ashish
    Guha, Prithwijit
    2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 9407 - 9414
  • [48] Multimodal attention-driven visual question answering for Malayalam
    Kovath A.G.
    Nayyar A.
    Sikha O.K.
    Neural Computing and Applications, 2024, 36 (24) : 14691 - 14708
  • [49] Deep Attention Neural Tensor Network for Visual Question Answering
    Bai, Yalong
    Fu, Jianlong
    Zhao, Tiejun
    Mei, Tao
    COMPUTER VISION - ECCV 2018, PT XII, 2018, 11216 : 21 - 37
  • [50] Word-to-region attention network for visual question answering
    Liang Peng
    Yang Yang
    Yi Bin
    Ning Xie
    Fumin Shen
    Yanli Ji
    Xing Xu
    Multimedia Tools and Applications, 2019, 78 : 3843 - 3858