Dual-decoder transformer network for answer grounding in visual question answering

被引:6
|
作者
Zhu, Liangjun [1 ]
Peng, Li [1 ]
Zhou, Weinan [1 ]
Yang, Jielong [1 ]
机构
[1] Jiangnan Univ, Engn Res Ctr Internet Things Appl Technol, Wuxi 214122, Peoples R China
基金
中国国家自然科学基金;
关键词
Visual question answering; Answer grounding; Dual-decoder transformer;
D O I
10.1016/j.patrec.2023.04.003
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Visual Question Answering (VQA) have made stunning advances by exploiting Transformer architecture and large-scale visual-linguistic pretraining. State-of-the-art methods generally require large amounts of data and devices to predict textualized answers and fail to provide visualized evidence of the answers. To mitigate these limitations, we propose a novel dual-decoder Transformer network (DDTN) for pre-dicting the language answer and corresponding vision instance. Specifically, the linguistic features are first embedded by Long Short-Term Memory (LSTM) block and Transformer encoder, which are shared between the Transformer dual-decoder. Then, we introduce object detector to obtain vision region fea-tures and grid features for reducing the size and cost of DDTN. These visual features are combined with the linguistic features and are respectively fed into two decoders. Moreover, we design an in-stance query to guide the fused visual-linguistic features for outputting the instance mask or bounding box. The classification layers aggregate results from decoders and predict answer as well as correspond-ing instance coordinates at last. Without bells and whistles, DDTN achieves state-of-the-art performance and even competitive to pretraining models on VizWizGround and GQA dataset. The code is available at https://github.com/zlj63501/DDTN .(c) 2023 Published by Elsevier B.V.
引用
收藏
页码:53 / 60
页数:8
相关论文
共 50 条
  • [31] A Transformer-based Medical Visual Question Answering Model
    Liu, Lei
    Su, Xiangdong
    Guo, Hui
    Zhu, Daobin
    2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2022, : 1712 - 1718
  • [32] Local self-attention in transformer for visual question answering
    Shen, Xiang
    Han, Dezhi
    Guo, Zihan
    Chen, Chongqing
    Hua, Jie
    Luo, Gaofeng
    APPLIED INTELLIGENCE, 2023, 53 (13) : 16706 - 16723
  • [33] VISION AND TEXT TRANSFORMER FOR PREDICTING ANSWERABILITY ON VISUAL QUESTION ANSWERING
    Le, Tung
    Huy Tien Nguyen
    Minh Le Nguyen
    2021 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2021, : 934 - 938
  • [34] TRAR: Routing the Attention Spans in Transformer for Visual Question Answering
    Zhou, Yiyi
    Ren, Tianhe
    Zhu, Chaoyang
    Sun, Xiaoshuai
    Liu, Jianzhuang
    Ding, Xinghao
    Xu, Mingliang
    Ji, Rongrong
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 2054 - 2064
  • [35] Transformer Module Networks for Systematic Generalization in Visual Question Answering
    Yamada, Moyuru
    D'Amario, Vanessa
    Takemoto, Kentaro
    Boix, Xavier
    Sasaki, Tomotake
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (12) : 10096 - 10105
  • [36] Text-Guided Dual-Branch Attention Network for Visual Question Answering
    Li, Mengfei
    Gu, Li
    Ji, Yi
    Liu, Chunping
    ADVANCES IN MULTIMEDIA INFORMATION PROCESSING, PT III, 2018, 11166 : 750 - 760
  • [37] Adversarial Learning of Answer-Related Representation for Visual Question Answering
    Liu, Yun
    Zhang, Xiaoming
    Huang, Feiran
    Li, Zhoujun
    CIKM'18: PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, 2018, : 1013 - 1022
  • [38] Answer-checking in Context: A Multi-modal Fully Attention Network for Visual Question Answering
    Huang, Hantao
    Han, Tao
    Han, Wei
    Yap, Deep
    Chiang, Cheng-Ming
    2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 1173 - 1180
  • [39] Answer Them All! Toward Universal Visual Question Answering Models
    Shrestha, Robik
    Kafle, Kushal
    Kanan, Christopher
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 10464 - 10473
  • [40] Reliable Visual Question Answering: Abstain Rather Than Answer Incorrectly
    Whitehead, Spencer
    Petryk, Suzanne
    Shakib, Vedaad
    Gonzalez, Joseph
    Darrell, Trevor
    Rohrbach, Anna
    Rohrbach, Marcus
    COMPUTER VISION, ECCV 2022, PT XXXVI, 2022, 13696 : 148 - 166