Incorporating Verb Semantic Information in Visual Question Answering Through Multitask Learning Paradigm

被引:0
|
作者
Alizadeh, Mehrdad [1 ]
Di Eugenio, Barbara [1 ]
机构
[1] Univ Illinois, Dept Comp Sci, Chicago, IL 60607 USA
关键词
Visual Question Answering; verb semantics; data augmentation; deep learning; multi-task learning;
D O I
10.1142/S1793351X20400085
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Visual Question Answering (VQA) concerns providing answers to Natural Language questions about images. Several deep neural network approaches have been proposed to model the task in an end-to-end fashion. Whereas the task is grounded in visual processing, if the question focuses on events described by verbs, the language understanding component becomes crucial. Our hypothesis is that models should be aware of verb semantics, as expressed via semantic role labels, argument types, and/or frame elements. Unfortunately, no VQA dataset exists that includes verb semantic information. Our first contribution is a new VQA dataset (imSituVQA) that we built by taking advantage of the imSitu annotations. The imSitu dataset consists of images manually labeled with semantic frame elements, mostly taken from FrameNet. Second, we propose a multi-task CNN-LSTM VQA model that learns to classify the answers as well as the semantic frame elements. Our experiments show that semantic frame element classification helps the VQA system avoid inconsistent responses and improves performance. Third, we employ an automatic semantic role labeler and annotate a subset of the VQA dataset (VQA(sub)). This way, the proposed multi-task CNN-LSTM VQA model can be trained with the VQA(sub) as well. The results show a slight improvement over the single-task CNN-LSTM model.
引用
收藏
页码:223 / 248
页数:26
相关论文
共 50 条
  • [21] Ask Your Neurons: A Deep Learning Approach to Visual Question Answering
    Mateusz Malinowski
    Marcus Rohrbach
    Mario Fritz
    International Journal of Computer Vision, 2017, 125 : 110 - 135
  • [22] Robust data augmentation and contrast learning for debiased visual question answering
    Ning, Ke
    Li, Zhixin
    NEUROCOMPUTING, 2025, 626
  • [23] Adversarial Learning with Bidirectional Attention for Visual Question Answering
    Li, Qifeng
    Tang, Xinyi
    Jian, Yi
    SENSORS, 2021, 21 (21)
  • [24] Learning Visual Question Answering by Bootstrapping Hard Attention
    Malinowski, Mateusz
    Doersch, Carl
    Santoro, Adam
    Battaglia, Peter
    COMPUTER VISION - ECCV 2018, PT VI, 2018, 11210 : 3 - 20
  • [25] Semantic-Aware Modular Capsule Routing for Visual Question Answering
    Han, Yudong
    Yin, Jianhua
    Wu, Jianlong
    Wei, Yinwei
    Nie, Liqiang
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2023, 32 : 5537 - 5549
  • [26] Focal and Composed Vision-semantic Modeling for Visual Question Answering
    Han, Yudong
    Guo, Yangyang
    Yin, Jianhua
    Liu, Meng
    Hu, Yupeng
    Nie, Liqiang
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 4528 - 4536
  • [27] Combining Deep Learning with Information Retrieval for Question Answering
    Yang, Fengyu
    Gan, Liang
    Li, Aiping
    Huang, Dongchuan
    Chou, Xiaohui
    Liu, Hongmei
    NATURAL LANGUAGE UNDERSTANDING AND INTELLIGENT APPLICATIONS (NLPCC 2016), 2016, 10102 : 917 - 925
  • [28] Improving Data Augmentation for Robust Visual Question Answering with Effective Curriculum Learning
    Zheng, Yuhang
    Wang, Zhen
    Chen, Long
    PROCEEDINGS OF THE 4TH ANNUAL ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2024, 2024, : 1084 - 1088
  • [29] Learning a Mixture of Conditional Gating Blocks for Visual Question Answering
    Sun, Qiang
    Fu, Yan-Wei
    Xue, Xiang-Yang
    JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY, 2024, 39 (04) : 912 - 928
  • [30] Explicit ensemble attention learning for improving visual question answering
    Lioutas, Vasileios
    Passalis, Nikolaos
    Tefas, Anastasios
    PATTERN RECOGNITION LETTERS, 2018, 111 : 51 - 57