Multitask Learning for Visual Question Answering

被引:27
作者
Ma, Jie [1 ,2 ]
Liu, Jun [1 ,2 ]
Lin, Qika [1 ,2 ]
Wu, Bei [1 ,2 ]
Wang, Yaxian [1 ,2 ]
You, Yang [3 ]
机构
[1] Xi An Jiao Tong Univ, Sch Comp Sci & Technol, Natl Engn Lab Big Data Analyt, Xian 710049, Peoples R China
[2] Xi An Jiao Tong Univ, Sch Comp Sci & Technol, Shaanxi Prov Key Lab Satellite & Terr Network Tec, Xian 710049, Peoples R China
[3] Natl Univ Singapore, Dept Comp Sci, High Performance Comp Artificial Intelligence Lab, Singapore 117417, Singapore
基金
中国国家自然科学基金;
关键词
Task analysis; Grounding; Visualization; Integrated circuit modeling; Knowledge discovery; Training; Data models; Information fusion; multimodality fusion; multitask learning; visual question answering (VQA);
D O I
10.1109/TNNLS.2021.3105284
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Visual question answering (VQA) is a task that machines should provide an accurate natural language answer given an image and a question about the image. Many studies have found that the current VQA methods are heavily driven by the surface correlation or statistical bias in the training data, and lack sufficient image grounding. To address this issue, we devise a novel end-to-end architecture that uses multitask learning to promote more sufficient image grounding and learn effective multimodality representations. The tasks consist of VQA and our proposed image cloze (IC) task requires machines to fill in the blanks accurately given an image and a textual description of the image. To ensure our model performs sufficient image grounding as much as possible, we propose a novel word-masking algorithm to develop the multimodal IC task based on the part-of-speech of words. Our model predicts the VQA answer and fills in the blanks after the multimodality representation learning that is shared by the two tasks. Experimental results show that our model achieves almost the equivalent, state-of-the-art, second-best performance on the VQA v2.0, VQA-changing priors (CP) v2, and grounded question answering (GQA) datasets, respectively, with fewer parameters and without additional data compared with baselines.
引用
收藏
页码:1380 / 1394
页数:15
相关论文
共 59 条
[1]   Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering [J].
Agrawal, Aishwarya ;
Batra, Dhruv ;
Parikh, Devi ;
Kembhavi, Aniruddha .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :4971-4980
[2]  
[Anonymous], 2016, ARXIV160401485
[3]   VQA: Visual Question Answering [J].
Antol, Stanislaw ;
Agrawal, Aishwarya ;
Lu, Jiasen ;
Mitchell, Margaret ;
Batra, Dhruv ;
Zitnick, C. Lawrence ;
Parikh, Devi .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2425-2433
[4]  
Bird S., 2002, arXiv
[5]   MUREL: Multimodal Relational Reasoning for Visual Question Answering [J].
Cadene, Remi ;
Ben-younes, Hedi ;
Cord, Matthieu ;
Thome, Nicolas .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :1989-1998
[6]   Interpretable Visual Question Answering by Reasoning on Dependency Trees [J].
Cao, Qingxing ;
Liang, Xiaodan ;
Li, Bailin ;
Lin, Liang .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2021, 43 (03) :887-901
[7]   Visual Question Reasoning on General Dependency Tree [J].
Cao, Qingxing ;
Liang, Xiaodan ;
Li, Bailin ;
Li, Guanbin ;
Lin, Liang .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :7249-7257
[8]   Multitask learning [J].
Caruana, R .
MACHINE LEARNING, 1997, 28 (01) :41-75
[9]  
Chaplot Devendra Singh, 2019, ARXIV190201385
[10]   Visual Dialog [J].
Das, Abhishek ;
Kottur, Satwik ;
Gupta, Khushi ;
Singh, Avi ;
Yadav, Deshraj ;
Moura, Jose M. F. ;
Parikh, Devi ;
Batra, Dhruv .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :1080-1089