Dual-decoder transformer network for answer grounding in visual question answering

被引:6
|
作者
Zhu, Liangjun [1 ]
Peng, Li [1 ]
Zhou, Weinan [1 ]
Yang, Jielong [1 ]
机构
[1] Jiangnan Univ, Engn Res Ctr Internet Things Appl Technol, Wuxi 214122, Peoples R China
基金
中国国家自然科学基金;
关键词
Visual question answering; Answer grounding; Dual-decoder transformer;
D O I
10.1016/j.patrec.2023.04.003
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Visual Question Answering (VQA) have made stunning advances by exploiting Transformer architecture and large-scale visual-linguistic pretraining. State-of-the-art methods generally require large amounts of data and devices to predict textualized answers and fail to provide visualized evidence of the answers. To mitigate these limitations, we propose a novel dual-decoder Transformer network (DDTN) for pre-dicting the language answer and corresponding vision instance. Specifically, the linguistic features are first embedded by Long Short-Term Memory (LSTM) block and Transformer encoder, which are shared between the Transformer dual-decoder. Then, we introduce object detector to obtain vision region fea-tures and grid features for reducing the size and cost of DDTN. These visual features are combined with the linguistic features and are respectively fed into two decoders. Moreover, we design an in-stance query to guide the fused visual-linguistic features for outputting the instance mask or bounding box. The classification layers aggregate results from decoders and predict answer as well as correspond-ing instance coordinates at last. Without bells and whistles, DDTN achieves state-of-the-art performance and even competitive to pretraining models on VizWizGround and GQA dataset. The code is available at https://github.com/zlj63501/DDTN .(c) 2023 Published by Elsevier B.V.
引用
收藏
页码:53 / 60
页数:8
相关论文
共 50 条
  • [41] Dual Attention and Question Categorization-Based Visual Question Answering
    Mishra A.
    Anand A.
    Guha P.
    IEEE Transactions on Artificial Intelligence, 2023, 4 (01): : 81 - 91
  • [42] VIBIKNet: Visual Bidirectional Kernelized Network for Visual Question Answering
    Bolanos, Marc
    Peris, Alvaro
    Casacuberta, Francisco
    Radeva, Petia
    PATTERN RECOGNITION AND IMAGE ANALYSIS (IBPRIA 2017), 2017, 10255 : 372 - 380
  • [43] A lightweight Transformer-based visual question answering network with Weight-Sharing Hybrid Attention
    Zhu, Yue
    Chen, Dongyue
    Jia, Tong
    Deng, Shizhuo
    NEUROCOMPUTING, 2024, 608
  • [44] Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering
    Xu, Huijuan
    Saenko, Kate
    COMPUTER VISION - ECCV 2016, PT VII, 2016, 9911 : 451 - 466
  • [45] Self-Adaptive Neural Module Transformer for Visual Question Answering
    Zhong, Huasong
    Chen, Jingyuan
    Shen, Chen
    Zhang, Hanwang
    Huang, Jianqiang
    Hua, Xian-Sheng
    IEEE TRANSACTIONS ON MULTIMEDIA, 2021, 23 : 1264 - 1273
  • [46] Co-Attention Network With Question Type for Visual Question Answering
    Yang, Chao
    Jiang, Mengqi
    Jiang, Bin
    Zhou, Weixin
    Li, Keqin
    IEEE ACCESS, 2019, 7 : 40771 - 40781
  • [47] Vision-Language Transformer for Interpretable Pathology Visual Question Answering
    Naseem, Usman
    Khushi, Matloob
    Kim, Jinman
    IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 2023, 27 (04) : 1681 - 1690
  • [48] CAT: Re-Conv Attention in Transformer for Visual Question Answering
    Zhang, Haotian
    Wu, Wei
    2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2022, : 1471 - 1477
  • [49] ADAPTIVE ATTENTION FUSION NETWORK FOR VISUAL QUESTION ANSWERING
    Gu, Geonmo
    Kim, Seong Tae
    Ro, Yong Man
    2017 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2017, : 997 - 1002
  • [50] Triple attention network for sentimental visual question answering
    Ruwa, Nelson
    Mao, Qirong
    Song, Heping
    Jia, Hongjie
    Dong, Ming
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2019, 189